Christian Huitema's blog

Cloudy sky, waves on the sea, the sun is
shining

Picoquic test coverage now at 90%

04 Dec 2024

A couple weeks ago, I got a message from researchers who had found a couple of pretty nasty bugs in Picoquic. These were quickly fixed, but they gave me a bad feeling. In theory, the core functions of Picoquic were well tested, and bugs of this kind were fixed a long time ago. But the code went through quite a bit of churn, as more functionality like multipath, ack frequency and web transport were implemented. Expectations also changed. Was the testing still sufficient? In a word, no. The measurement using gcov and gcovr showed that for the main libraries, the test coverage stood at 84.5% of lines and 72.5% of branches. Some parts were well tested, but many of the new additions had only minimal coverage. The two bugs found by the researchers were found in HTTP3 parsing of queries, in HTTP3 code that was initially meant as a quick demonstration but that most users ended up using it as is. It was now time to suspend further development of Web Transport, Media over Quic or P2P Quic until the code quality has improved.

The effort concentrated on testing three main libraries:

In total, that’s 32,479 lines of code. I used the results of gcov to pick the worst offending areas one by one. For each of those, I wrote additional tests to reach the uncovered areas, and sometimes managed to remove existing code that was not useful in our scenario – at the end of the exercise, the code was 345 lines smaller, or 1.07%. In total, I did 12 PRs over the course of 3 weeks, before hitting a goal of at least 90% lines and 75% branches. The list is available in GitHub issue 1781 in the picoquic depot, summarized in this table:

  Cover lines Cover functions Cover branches
As of 2024/11/20 84.5% 89.5% 72.5%
Siduck removal 84.8% 89.8% 72.8%
Test getters and setters in the API 85.4% 91.6% 73.4%
More frame parsing / formatting tests 86.2% 92.0% 74.3%
Fix the spinbit tests 86.2% 92.1% 74.3%
Add tests for BBRv1 86.8% 92.7% 74.9%
Add tests for TLS API 87.3% 93.4% 75.2%
H3Zero Client Data tests 87.6% 93.6% 75.6%
Test of transport parameter code 87.7% 93.8% 75.7%
Demo client tests 88.2% 93.9% 76.2%
Test of logging functions 89.6% 95.7% 76.9%
Test of Web Transport & capsule 89.8% 95.8% 77.0%
Remove unused packet code 89.9% 95.8% 77.2%
Prune and test utility functions 90.1% 96.1% 77.3%

I tried to avoid writing tests that merely targeted the coverage metric. In some cases, that’s unavoidable. For example, the API includes a set of “getter” and “setter” functions, so applications can check variables without building a dependency on the implementation of the protocol. These are simple functions, but they do count in test coverage, so I ended up writing a bunch of simple tests for them. But in general, it is better to perform the test using a high level API. It won’t test only what happens when the code in the targeted branch is exercised, it will also test the impact of that behavior.

I kept the exercise focused on writing tests, changing the code as little as possible. Inspecting the code, it is obvious that some areas need improvement and possibly refactoring – particularly the way congestion control algorithms are written, see issue 1794, and the structure of the HTTP3 and Web Transport code, see issue 1795. But doing test improvement and code refactoring at the same time is risky, and could take much longer than expected.

The coverage of branches at the end of exercise remains at 77.3%. Many of the uncovered branches can only be triggered by a failure in memory allocation. There are known techniques for testing that, essentially replacing malloc by a test version, but I need to find a way to integrate this special version of malloc in the existing stress tests. An example would be driving the load on a server, accepting that some connections to the server will fail, but expecting that at least some will succeed and that the server will return to normal when the load diminishes. The problem is to isolate the malloc failures so they happen on the server but not on the client, knowing that for our tests client and server are running the same Quic code in the same process. Suggestions are welcome.

Did I find bugs? Of course. At the beginning of the exercise, I was concerned that merely driving the test metric may not improve the code all that much. But in practice it did. Many small bugs, but also a few big ones, the kind that lead to crashes or loops. Overall, it convinced me that not enforcing these test metrics earlier was a mistake!

Comments

If you want to start or join a discussion on this post, the simplest way is to send a toot on the Fediverse/Mastodon to @huitema@social.secret-wg.org.