The list of past posts including those previously published on Wordpress is available here.
A couple weeks ago, I got a message from researchers who had found a couple
of pretty nasty bugs in Picoquic. These were quickly fixed, but they gave
me a bad feeling. In theory, the core functions of Picoquic were
well tested, and bugs of this kind were fixed a long time ago.
But the code went through quite a bit of churn, as more functionality
like multipath, ack frequency and web transport were implemented.
Expectations also changed. Was the testing still sufficient? In a word, no.
The measurement using gcov
and gcovr
showed that for the main libraries,
the test coverage stood at 84.5% of lines and
72.5% of branches. Some parts were well tested, but many of the new
additions had only minimal coverage. The two bugs found by the researchers
were found in HTTP3 parsing of queries, in HTTP3 code that was initially
meant as a quick demonstration but that most
users ended up using it as is. It was now time to suspend further
development of Web Transport, Media over
Quic or P2P Quic until the code quality has improved.
The effort concentrated on testing three main libraries:
In total, that’s 32,479 lines of code. I used the results of gcov
to pick the
worst offending areas one by one. For each of those, I wrote additional tests to
reach the uncovered areas, and sometimes managed to remove existing code that
was not useful in our scenario – at the end of the exercise, the code was
345 lines smaller, or 1.07%.
In total, I did 12 PRs over the course of 3 weeks, before
hitting a goal of at least 90% lines and 75% branches. The list is available in
GitHub issue 1781
in the picoquic depot, summarized in this table:
Cover lines | Cover functions | Cover branches | |
---|---|---|---|
As of 2024/11/20 | 84.5% | 89.5% | 72.5% |
Siduck removal | 84.8% | 89.8% | 72.8% |
Test getters and setters in the API | 85.4% | 91.6% | 73.4% |
More frame parsing / formatting tests | 86.2% | 92.0% | 74.3% |
Fix the spinbit tests | 86.2% | 92.1% | 74.3% |
Add tests for BBRv1 | 86.8% | 92.7% | 74.9% |
Add tests for TLS API | 87.3% | 93.4% | 75.2% |
H3Zero Client Data tests | 87.6% | 93.6% | 75.6% |
Test of transport parameter code | 87.7% | 93.8% | 75.7% |
Demo client tests | 88.2% | 93.9% | 76.2% |
Test of logging functions | 89.6% | 95.7% | 76.9% |
Test of Web Transport & capsule | 89.8% | 95.8% | 77.0% |
Remove unused packet code | 89.9% | 95.8% | 77.2% |
Prune and test utility functions | 90.1% | 96.1% | 77.3% |
I tried to avoid writing tests that merely targeted the coverage metric. In some cases, that’s unavoidable. For example, the API includes a set of “getter” and “setter” functions, so applications can check variables without building a dependency on the implementation of the protocol. These are simple functions, but they do count in test coverage, so I ended up writing a bunch of simple tests for them. But in general, it is better to perform the test using a high level API. It won’t test only what happens when the code in the targeted branch is exercised, it will also test the impact of that behavior.
I kept the exercise focused on writing tests, changing the code as little as possible. Inspecting the code, it is obvious that some areas need improvement and possibly refactoring – particularly the way congestion control algorithms are written, see issue 1794, and the structure of the HTTP3 and Web Transport code, see issue 1795. But doing test improvement and code refactoring at the same time is risky, and could take much longer than expected.
The coverage of branches at the end of exercise remains at 77.3%. Many of the uncovered branches can only be triggered by a failure in memory allocation. There are known techniques for testing that, essentially replacing malloc by a test version, but I need to find a way to integrate this special version of malloc in the existing stress tests. An example would be driving the load on a server, accepting that some connections to the server will fail, but expecting that at least some will succeed and that the server will return to normal when the load diminishes. The problem is to isolate the malloc failures so they happen on the server but not on the client, knowing that for our tests client and server are running the same Quic code in the same process. Suggestions are welcome.
Did I find bugs? Of course. At the beginning of the exercise, I was concerned that merely driving the test metric may not improve the code all that much. But in practice it did. Many small bugs, but also a few big ones, the kind that lead to crashes or loops. Overall, it convinced me that not enforcing these test metrics earlier was a mistake!
This is a story about bugs, but also about testing. On Tuesday November 12, I received an email from a research group investigating how to test QUIC protocol implementations. During their tests they identified two bugs in Picoquic, which both involve parsing HTTP 3 protocol headers. One bug involved reading one byte out of bound, the other dereferencing a NULL parser. These bugs are now fixed in the latest version of Picoquic (version 1.1.28.0, commit ce9bdb709657c10e8b2eb0233a3d31c365a2a07c).
The team found these bugs using a QUIC protocol specific fuzzer, QUICTester/QUIC-Fuzz, which is described very briefly as a grey-box mutation-based fuzzer. They will not provide more details for now, and I understand that. Academic researchers like them have to build a record of publication. You would not want to conduct a long research and have reviewer number 2 reject your submission “because the results are already well known”. I would thus refrain from speculating how they built it. My only comment is that their tool is working, and that they could effectively test implementations of encrypted protocols.
I used simpler techniques to reproduce the bug in my implementation. For each bug, I wrote or improved a unit test that directly fuzzed the internal API responsible for the bug – a much simpler task than testing a whole protocol implementation. Once I got a stable way to reproduce the bug, the fixes were relatively simple: tightening the check of bounds during the parsing of QPack messages and prevent the out of bound access, and better controlling memory allocation when receiving HTTP3 messages from the peer, to prevent dereferencing a NULL pointer.
Of course, I should have done this kind of API fuzzing and rooted out these bugs long ago. The Picoquic test suite does include systematic tests like that for the QUIC level messages, but the HTTP3 level tests were not quite as extensive. My preferred metaphor is that exploring the code for potential bugs is like exploring a dark basement with a flashlight. Each test in the test suite shines light on a small part of the basement, finding pockets of dirt and the nests of undesirable critters. But it takes very many flash lights before everything is explored. And it takes writing many more tests until most issues are rooted out. All help for that is welcome!
Marten Seemann and I just published an entry on his blog, A p2p Vision for QUIC. This is a technical argument, explaining how a series of small improvements to the QUIC standards would suffice to make QUIC a great peer-to-peer (P2P) protocol. I personally love it, because the Internet suffers greatly from excessive centralization, with big platforms acting as gatekeepers.
Currently, pretty much all user activities on the Internet are mediated by “services in the cloud”. Take for example something as simple as a security camera in your home. People want to check the camera images when they are outside the home. They cannot connect to the camera or even a home server directly, because the home network is “behind a NAT” – and very possibly behind multiple layers of NAT, given the scarcity of IPv4 addresses. The practical solution is to export the images from the camera to a “service in the cloud”, and to let user contact that service from their cellphone. It works, but it has obvious issues for privacy, security, or even continuity of service: if the cloud service stops, the camera gets bricked.
The proposed QUIC architecture would provide a generic solution using the Masque standard for relaying QUIC, combine with QUIC NAT traversal extensions to migrate connections from “using the relay” to “directly P2P”, as shown in the following diagram.
+------------------------------+
| Step 1: phone contacts |
v the camera through +---|---+
+-----------+-+ the relay | Phone |
| Masque ^ | +---|---+
| Relay | | |
+-----------|-+ +-------------------------+
| | Step 2: phone and camera cooperate
+-----|----|-+ and establish a "hole" in the NAT.
| NAT | | |
+-----|----|-+ Step 3: traffic flows directly
| | between phone and camera without
+-|----|-+ burdening the relay
| Camera |
+--------+
The standards to achieve that are almost ready. The relaying of incoming connections is described in the UDP Listen, already adopted in the Masque Working Group. A draft submitted to the IETF QUIC Working Group describes the Using QUIC to Traverse NAT, and another describes QUIC Address Discovery. I have already implemented the Address Discovery draft in picoquic, and intend to promptly implement the other two drafts.
Using QUIC has many advantages. The communications are encrypted end to end, even when they are going through the relay, which has nice security and privacy benefits. The relay is uses standards and is not dependent on the application; if a relay goes out of business, just pick another one. In the weird cases where the NAT traversal fails, the application still works.
There are of course many more applications of P2P possible, such as for example direct video calls between two homes, video games and more. The vision is clear: we can do all those applications using QUIC.
The first investigation of loopback performance focused on the randomness of the RTT measurements on loopback, and the fix for BBR was to not make congestion control decisions based on delays if the RTT were small. After that, the results “looked good”, although there was quite a bit of variability in the results. The tests were made by sending 1GB of data. Looking at traces, it seemed that many connections were starting to slow down at the end of the 1GB transfer. I thus repeated the experiments, this time sending 10GB of data instead of 1GB. That was quite instructive: the data rate of a Cubic connection dropped by a factor 2!
The figure 1 shows the result. Each of the lines correspond to 16 connections: downloading either 1GB or 10GB between a picoquic client and a picoquic server on my laptop (Dell XPS 16, Intel(R) Core(TM) Ultra 9 185H 2.50 GHz, Windows 11 23H2) for each of BBR and Cubic, using the PERF application protocol. The Cubic line for 10 GB transfers shows data rates varying from 1.3 to 1.8 Gbps, a stark contrast with the 1,7 to 3.5 Gbps observed on the 1GB downloads.
Checking the statistics for the various connections did not show any excessive rate of packet retransmission, but it did show that the client sent only about 50,000 acknowledgements per connection, which translates to an acknowledgement every 1.1ms. That does not compare well with the average RTT of 65 microseconds. These long intervals between acknowledgements could be traced to the “ACK Frequency” implementation: the value of the “Min ACK delay” was set to 1ms, and the value of the “max gap” was set to 1/4th of the congestion window. Detailed traces could show that this congestion window was initially quite large, which explained why ACKs were so infrequent. Infrequent ACKs means that the Cubic code was not getting feedback quickly enough, and reacted by slowing down over time.
I tried a series of more or less complex fixes, but the simplest and most effective one was to cap the ack gap value to 32 if the minimum RTT was smaller than 4 times the ACK delay. As you can see in figure 2, the effect on the Cubic performance was drastic. The average data rate jumped to 3.4 Gbps, and the variability was much reduced, from 3.3 to 3.5 Gbps. The effect on BBR was also positive, but less marked.
An ACK gap of 32 corresponds to about 100 microseconds between ACK in the Cubic trials. This is still much larger than the min RTT, but close to the average smoothed RTT. Trying to get more frequent acknowledgements was counter productive, probably because the processing of more acknowledgements increased the CPU overhead. Trying higher values like 64 was also counterproductive, because Cubic would slow down the connection.
This experiment and the previous one outlined that “very high bandwidth very low latency” connections are a special case. The code is close to saturating the CPU, which forces a reduction in the ACK rate, with the average delay between acknowledgements often larger than the actual min. The variations of delay measurement for an application level transport are quite high, which forces us to not use delay measurements to detect queues. The main control variable is then the rate of packet losses, which ended up between 0.4 and 0.5% in the experiments.
It would be nice if the interface could implement ECN markings, so congestion control could react to increased queues before losing packets. Failing that, it might make sense to monitor the “bottleneck data rate”, not necessarily to deduce a pacing rate like BBR does, but at least to stop increasing the congestion window if the data rate is not increasing. But that’s for further research.
I did a quick experiment last week, measuring performance of picoquic on a loopback interface. The goal was to look at software performance, but I wanted to tease out how much of the performance depends on CPU consumption and how much depends on control, in particular congestion control. I compared connections using Cubic and my implementation of BBRv3. I observed that BBRv3 was significantly slower than Cubic. The investigation pointed out the dependency of BBRv3 on RTT measurements, which in application level implementations contain a lot of jitter caused by process scheduling during calls to OS APIs. This blog presents the problem, the analysis, the first fixes, and the open question of bandwidth measurement in low latency networks.
The result of the loopback tests are displayed in figure 1. I ran 16 QUIC connections between a picoquic client and a picoquic server on my laptop (Dell XPS 16, Intel(R) Core(TM) Ultra 9 185H 2.50 GHz, Windows 11 23H2) for each of BBR and Cubic, downloading 1GB from the server using the PERF application protocol. The figure shows the observed values of the data rate in Mbps, 16 numbers for each congestion control protocol, organized for each protocol in a line from lowest to highest. The first tests showed that BBR is markedly slower, averaging 1.7Gbps versus 3 Gbps for Cubic. The BBR data rates are much more variable, between 1.1 and 2.4 Gbps, versus between 2.6 and 3.8 Gbps for Cubic.
To analyze the results, I had to add “in memory” logging to the picoquic stack. The current logging writes on disk, which impacts performance, and thus cannot easily be used to understand performance at high speed. I instrumented the code to log a bunch of variables in a large table in memory. That requires allocating a couple 100MB at program start for logging one connection, so it is not something that should be done in production. But it does work on single connections, and the impact on perf is OK. I dug into 2 connections: one using Cubic, getting 3Gbps on loopback on Windows; another using BBR and getting 2Gbps.
The Cubic connection is easy to understand: Cubic basically goes out of the way. The CWND is always much larger than the bytes in transit, the pacing rate is very high. The sender ends up CPU bound. On the other hand, the analysis of the BBR logs showed that the sending rate was clearly limited by the pacing rate set by BBR.
The figure 2 shows the evolution of the pacing rate (dotted orange line) and the measured throughput (dark green line) over 4 seconds of connection. The figure also shows the RTT measurements (downsampled from 700000 to about 20000) in microseconds at the bottom of the graph, and the evolution of the BBR internal state as a light blue line. The graph shows that throughput is clearly limited by the pacing rate, but that the pacing rate varies widely. This was surprising, because BBR v3 normally alternates between probing phases and long periods in “cruise” state. The graph also shows the variations of the RTT, with RTT spikes present at about the same times as the transitions in the pacing rate.
The figure 3 displays an histogram of the RTT samples throughout the connection. The values are widely dispersed, from a minimum of 20 microseconds to outliers up to 1400 microseconds. Congestion controls based on delay measurements assume that increases in delay indicate queues building up in the network, and should be taken as a signal to slow down the connection. However, the graph on figure 2 does not show a clear correlation between sending rate and queue buildup.
Picoquic runs in the application context. Receiving and acknowledgement requires waiting for an event until the asynchronous receive call is complete, and then receiving the data. Previous measurements show that the delay between raising an event on a thread and return from waiting on another is usually very short, maybe a few microseconds, but can sometime be delays by tens of microseconds, or maybe hundreds. These delays will happen randomly. If the path capacity is a few hundred megabits or less, these random delays are much lower than the delays caused by transmission queues. But in our case, these delays can have a nasty effect.
The BBRv3 implementation in picoquic includes improvements to avoid queue buildups and maintain the responsiveness of real time application, see this draft. Two of the proposed improvements are “exiting startup early if the delays increase” and “exiting the probeBW UP state early if the delay increases”. This works quite well on most long distance Internet connection, but in our case, they become “exit the probe state randomly because the measure of delay increased randomly”. Not good.
The solution was to disable these two “delay based” tests in BBR if the minimum delay of the path is very short. As a first test, we define “very short” as “less than 128 microseconds”. We may have to review that value in the future, but figure 4 shows that disabling the two tests results in a marked improvement. Figure 4 is identical to figure 1, except for an extra line showing 16 tests with our improved version of BBR. This version achieves data rates between 1.4 and 3.5 Gbps, and is on average 50% faster than the unmodified version. Lots of results for changing just two lines of code, but as usual the problem is to find out which lines…
The results are improved, but BBR is still about 15% slower than Cubic in this scenario. I suspect that this is because the randomness in delay measurements also causes randomness in the bandwidth measurements, and that this randomness somehow causes slow-downs. In figure 2, we see periods in which BBR is in “cruise” state, yet the pacing rate suddenly drops. This could be caused by BBR receiving an abnormally slow measurement and dropping the pacing rate accordingly. The next bandwidth measurements are limited by that new pacing rate, so the rate has effectively “drifted down”. Another problem that we will need to investigate…