04 Dec 2024
A couple weeks ago, I got a message from researchers who had found a couple
of pretty nasty bugs in Picoquic. These were quickly fixed, but they gave
me a bad feeling. In theory, the core functions of Picoquic were
well tested, and bugs of this kind were fixed a long time ago.
But the code went through quite a bit of churn, as more functionality
like multipath, ack frequency and web transport were implemented.
Expectations also changed. Was the testing still sufficient? In a word, no.
The measurement using gcov
and gcovr
showed that for the main libraries,
the test coverage stood at 84.5% of lines and
72.5% of branches. Some parts were well tested, but many of the new
additions had only minimal coverage. The two bugs found by the researchers
were found in HTTP3 parsing of queries, in HTTP3 code that was initially
meant as a quick demonstration but that most
users ended up using it as is. It was now time to suspend further
development of Web Transport, Media over
Quic or P2P Quic until the code quality has improved.
The effort concentrated on testing three main libraries:
In total, that’s 32,479 lines of code. I used the results of gcov
to pick the
worst offending areas one by one. For each of those, I wrote additional tests to
reach the uncovered areas, and sometimes managed to remove existing code that
was not useful in our scenario – at the end of the exercise, the code was
345 lines smaller, or 1.07%.
In total, I did 12 PRs over the course of 3 weeks, before
hitting a goal of at least 90% lines and 75% branches. The list is available in
GitHub issue 1781
in the picoquic depot, summarized in this table:
Cover lines | Cover functions | Cover branches | |
---|---|---|---|
As of 2024/11/20 | 84.5% | 89.5% | 72.5% |
Siduck removal | 84.8% | 89.8% | 72.8% |
Test getters and setters in the API | 85.4% | 91.6% | 73.4% |
More frame parsing / formatting tests | 86.2% | 92.0% | 74.3% |
Fix the spinbit tests | 86.2% | 92.1% | 74.3% |
Add tests for BBRv1 | 86.8% | 92.7% | 74.9% |
Add tests for TLS API | 87.3% | 93.4% | 75.2% |
H3Zero Client Data tests | 87.6% | 93.6% | 75.6% |
Test of transport parameter code | 87.7% | 93.8% | 75.7% |
Demo client tests | 88.2% | 93.9% | 76.2% |
Test of logging functions | 89.6% | 95.7% | 76.9% |
Test of Web Transport & capsule | 89.8% | 95.8% | 77.0% |
Remove unused packet code | 89.9% | 95.8% | 77.2% |
Prune and test utility functions | 90.1% | 96.1% | 77.3% |
I tried to avoid writing tests that merely targeted the coverage metric. In some cases, that’s unavoidable. For example, the API includes a set of “getter” and “setter” functions, so applications can check variables without building a dependency on the implementation of the protocol. These are simple functions, but they do count in test coverage, so I ended up writing a bunch of simple tests for them. But in general, it is better to perform the test using a high level API. It won’t test only what happens when the code in the targeted branch is exercised, it will also test the impact of that behavior.
I kept the exercise focused on writing tests, changing the code as little as possible. Inspecting the code, it is obvious that some areas need improvement and possibly refactoring – particularly the way congestion control algorithms are written, see issue 1794, and the structure of the HTTP3 and Web Transport code, see issue 1795. But doing test improvement and code refactoring at the same time is risky, and could take much longer than expected.
The coverage of branches at the end of exercise remains at 77.3%. Many of the uncovered branches can only be triggered by a failure in memory allocation. There are known techniques for testing that, essentially replacing malloc by a test version, but I need to find a way to integrate this special version of malloc in the existing stress tests. An example would be driving the load on a server, accepting that some connections to the server will fail, but expecting that at least some will succeed and that the server will return to normal when the load diminishes. The problem is to isolate the malloc failures so they happen on the server but not on the client, knowing that for our tests client and server are running the same Quic code in the same process. Suggestions are welcome.
Did I find bugs? Of course. At the beginning of the exercise, I was concerned that merely driving the test metric may not improve the code all that much. But in practice it did. Many small bugs, but also a few big ones, the kind that lead to crashes or loops. Overall, it convinced me that not enforcing these test metrics earlier was a mistake!
If you want to start or join a discussion on this post, the simplest way is to send a toot on the Fediverse/Mastodon to @huitema@social.secret-wg.org.