The list of past posts including those previously published on Wordpress is available here.
This is a story about bugs, but also about testing. On Tuesday November 12, I received an email from a research group investigating how to test QUIC protocol implementations. During their tests they identified two bugs in Picoquic, which both involve parsing HTTP 3 protocol headers. One bug involved reading one byte out of bound, the other dereferencing a NULL parser. These bugs are now fixed in the latest version of Picoquic (version 1.1.28.0, commit ce9bdb709657c10e8b2eb0233a3d31c365a2a07c).
The team found these bugs using a QUIC protocol specific fuzzer, QUICTester/QUIC-Fuzz, which is described very briefly as a grey-box mutation-based fuzzer. They will not provide more details for now, and I understand that. Academic researchers like them have to build a record of publication. You would not want to conduct a long research and have reviewer number 2 reject your submission “because the results are already well known”. I would thus refrain from speculating how they built it. My only comment is that their tool is working, and that they could effectively test implementations of encrypted protocols.
I used simpler techniques to reproduce the bug in my implementation. For each bug, I wrote or improved a unit test that directly fuzzed the internal API responsible for the bug – a much simpler task than testing a whole protocol implementation. Once I got a stable way to reproduce the bug, the fixes were relatively simple: tightening the check of bounds during the parsing of QPack messages and prevent the out of bound access, and better controlling memory allocation when receiving HTTP3 messages from the peer, to prevent dereferencing a NULL pointer.
Of course, I should have done this kind of API fuzzing and rooted out these bugs long ago. The Picoquic test suite does include systematic tests like that for the QUIC level messages, but the HTTP3 level tests were not quite as extensive. My preferred metaphor is that exploring the code for potential bugs is like exploring a dark basement with a flashlight. Each test in the test suite shines light on a small part of the basement, finding pockets of dirt and the nests of undesirable critters. But it takes very many flash lights before everything is explored. And it takes writing many more tests until most issues are rooted out. All help for that is welcome!
Marten Seemann and I just published an entry on his blog, A p2p Vision for QUIC. This is a technical argument, explaining how a series of small improvements to the QUIC standards would suffice to make QUIC a great peer-to-peer (P2P) protocol. I personally love it, because the Internet suffers greatly from excessive centralization, with big platforms acting as gatekeepers.
Currently, pretty much all user activities on the Internet are mediated by “services in the cloud”. Take for example something as simple as a security camera in your home. People want to check the camera images when they are outside the home. They cannot connect to the camera or even a home server directly, because the home network is “behind a NAT” – and very possibly behind multiple layers of NAT, given the scarcity of IPv4 addresses. The practical solution is to export the images from the camera to a “service in the cloud”, and to let user contact that service from their cellphone. It works, but it has obvious issues for privacy, security, or even continuity of service: if the cloud service stops, the camera gets bricked.
The proposed QUIC architecture would provide a generic solution using the Masque standard for relaying QUIC, combine with QUIC NAT traversal extensions to migrate connections from “using the relay” to “directly P2P”, as shown in the following diagram.
+------------------------------+
| Step 1: phone contacts |
v the camera through +---|---+
+-----------+-+ the relay | Phone |
| Masque ^ | +---|---+
| Relay | | |
+-----------|-+ +-------------------------+
| | Step 2: phone and camera cooperate
+-----|----|-+ and establish a "hole" in the NAT.
| NAT | | |
+-----|----|-+ Step 3: traffic flows directly
| | between phone and camera without
+-|----|-+ burdening the relay
| Camera |
+--------+
The standards to achieve that are almost ready. The relaying of incoming connections is described in the UDP Listen, already adopted in the Masque Working Group. A draft submitted to the IETF QUIC Working Group describes the Using QUIC to Traverse NAT, and another describes QUIC Address Discovery. I have already implemented the Address Discovery draft in picoquic, and intend to promptly implement the other two drafts.
Using QUIC has many advantages. The communications are encrypted end to end, even when they are going through the relay, which has nice security and privacy benefits. The relay is uses standards and is not dependent on the application; if a relay goes out of business, just pick another one. In the weird cases where the NAT traversal fails, the application still works.
There are of course many more applications of P2P possible, such as for example direct video calls between two homes, video games and more. The vision is clear: we can do all those applications using QUIC.
The first investigation of loopback performance focused on the randomness of the RTT measurements on loopback, and the fix for BBR was to not make congestion control decisions based on delays if the RTT were small. After that, the results “looked good”, although there was quite a bit of variability in the results. The tests were made by sending 1GB of data. Looking at traces, it seemed that many connections were starting to slow down at the end of the 1GB transfer. I thus repeated the experiments, this time sending 10GB of data instead of 1GB. That was quite instructive: the data rate of a Cubic connection dropped by a factor 2!
The figure 1 shows the result. Each of the lines correspond to 16 connections: downloading either 1GB or 10GB between a picoquic client and a picoquic server on my laptop (Dell XPS 16, Intel(R) Core(TM) Ultra 9 185H 2.50 GHz, Windows 11 23H2) for each of BBR and Cubic, using the PERF application protocol. The Cubic line for 10 GB transfers shows data rates varying from 1.3 to 1.8 Gbps, a stark contrast with the 1,7 to 3.5 Gbps observed on the 1GB downloads.
Checking the statistics for the various connections did not show any excessive rate of packet retransmission, but it did show that the client sent only about 50,000 acknowledgements per connection, which translates to an acknowledgement every 1.1ms. That does not compare well with the average RTT of 65 microseconds. These long intervals between acknowledgements could be traced to the “ACK Frequency” implementation: the value of the “Min ACK delay” was set to 1ms, and the value of the “max gap” was set to 1/4th of the congestion window. Detailed traces could show that this congestion window was initially quite large, which explained why ACKs were so infrequent. Infrequent ACKs means that the Cubic code was not getting feedback quickly enough, and reacted by slowing down over time.
I tried a series of more or less complex fixes, but the simplest and most effective one was to cap the ack gap value to 32 if the minimum RTT was smaller than 4 times the ACK delay. As you can see in figure 2, the effect on the Cubic performance was drastic. The average data rate jumped to 3.4 Gbps, and the variability was much reduced, from 3.3 to 3.5 Gbps. The effect on BBR was also positive, but less marked.
An ACK gap of 32 corresponds to about 100 microseconds between ACK in the Cubic trials. This is still much larger than the min RTT, but close to the average smoothed RTT. Trying to get more frequent acknowledgements was counter productive, probably because the processing of more acknowledgements increased the CPU overhead. Trying higher values like 64 was also counterproductive, because Cubic would slow down the connection.
This experiment and the previous one outlined that “very high bandwidth very low latency” connections are a special case. The code is close to saturating the CPU, which forces a reduction in the ACK rate, with the average delay between acknowledgements often larger than the actual min. The variations of delay measurement for an application level transport are quite high, which forces us to not use delay measurements to detect queues. The main control variable is then the rate of packet losses, which ended up between 0.4 and 0.5% in the experiments.
It would be nice if the interface could implement ECN markings, so congestion control could react to increased queues before losing packets. Failing that, it might make sense to monitor the “bottleneck data rate”, not necessarily to deduce a pacing rate like BBR does, but at least to stop increasing the congestion window if the data rate is not increasing. But that’s for further research.
I did a quick experiment last week, measuring performance of picoquic on a loopback interface. The goal was to look at software performance, but I wanted to tease out how much of the performance depends on CPU consumption and how much depends on control, in particular congestion control. I compared connections using Cubic and my implementation of BBRv3. I observed that BBRv3 was significantly slower than Cubic. The investigation pointed out the dependency of BBRv3 on RTT measurements, which in application level implementations contain a lot of jitter caused by process scheduling during calls to OS APIs. This blog presents the problem, the analysis, the first fixes, and the open question of bandwidth measurement in low latency networks.
The result of the loopback tests are displayed in figure 1. I ran 16 QUIC connections between a picoquic client and a picoquic server on my laptop (Dell XPS 16, Intel(R) Core(TM) Ultra 9 185H 2.50 GHz, Windows 11 23H2) for each of BBR and Cubic, downloading 1GB from the server using the PERF application protocol. The figure shows the observed values of the data rate in Mbps, 16 numbers for each congestion control protocol, organized for each protocol in a line from lowest to highest. The first tests showed that BBR is markedly slower, averaging 1.7Gbps versus 3 Gbps for Cubic. The BBR data rates are much more variable, between 1.1 and 2.4 Gbps, versus between 2.6 and 3.8 Gbps for Cubic.
To analyze the results, I had to add “in memory” logging to the picoquic stack. The current logging writes on disk, which impacts performance, and thus cannot easily be used to understand performance at high speed. I instrumented the code to log a bunch of variables in a large table in memory. That requires allocating a couple 100MB at program start for logging one connection, so it is not something that should be done in production. But it does work on single connections, and the impact on perf is OK. I dug into 2 connections: one using Cubic, getting 3Gbps on loopback on Windows; another using BBR and getting 2Gbps.
The Cubic connection is easy to understand: Cubic basically goes out of the way. The CWND is always much larger than the bytes in transit, the pacing rate is very high. The sender ends up CPU bound. On the other hand, the analysis of the BBR logs showed that the sending rate was clearly limited by the pacing rate set by BBR.
The figure 2 shows the evolution of the pacing rate (dotted orange line) and the measured throughput (dark green line) over 4 seconds of connection. The figure also shows the RTT measurements (downsampled from 700000 to about 20000) in microseconds at the bottom of the graph, and the evolution of the BBR internal state as a light blue line. The graph shows that throughput is clearly limited by the pacing rate, but that the pacing rate varies widely. This was surprising, because BBR v3 normally alternates between probing phases and long periods in “cruise” state. The graph also shows the variations of the RTT, with RTT spikes present at about the same times as the transitions in the pacing rate.
The figure 3 displays an histogram of the RTT samples throughout the connection. The values are widely dispersed, from a minimum of 20 microseconds to outliers up to 1400 microseconds. Congestion controls based on delay measurements assume that increases in delay indicate queues building up in the network, and should be taken as a signal to slow down the connection. However, the graph on figure 2 does not show a clear correlation between sending rate and queue buildup.
Picoquic runs in the application context. Receiving and acknowledgement requires waiting for an event until the asynchronous receive call is complete, and then receiving the data. Previous measurements show that the delay between raising an event on a thread and return from waiting on another is usually very short, maybe a few microseconds, but can sometime be delays by tens of microseconds, or maybe hundreds. These delays will happen randomly. If the path capacity is a few hundred megabits or less, these random delays are much lower than the delays caused by transmission queues. But in our case, these delays can have a nasty effect.
The BBRv3 implementation in picoquic includes improvements to avoid queue buildups and maintain the responsiveness of real time application, see this draft. Two of the proposed improvements are “exiting startup early if the delays increase” and “exiting the probeBW UP state early if the delay increases”. This works quite well on most long distance Internet connection, but in our case, they become “exit the probe state randomly because the measure of delay increased randomly”. Not good.
The solution was to disable these two “delay based” tests in BBR if the minimum delay of the path is very short. As a first test, we define “very short” as “less than 128 microseconds”. We may have to review that value in the future, but figure 4 shows that disabling the two tests results in a marked improvement. Figure 4 is identical to figure 1, except for an extra line showing 16 tests with our improved version of BBR. This version achieves data rates between 1.4 and 3.5 Gbps, and is on average 50% faster than the unmodified version. Lots of results for changing just two lines of code, but as usual the problem is to find out which lines…
The results are improved, but BBR is still about 15% slower than Cubic in this scenario. I suspect that this is because the randomness in delay measurements also causes randomness in the bandwidth measurements, and that this randomness somehow causes slow-downs. In figure 2, we see periods in which BBR is in “cruise” state, yet the pacing rate suddenly drops. This could be caused by BBR receiving an abnormally slow measurement and dropping the pacing rate accordingly. The next bandwidth measurements are limited by that new pacing rate, so the rate has effectively “drifted down”. Another problem that we will need to investigate…
I just fixed another bad bug in Picoquic. The bug, found by Asakura Mizu, relates to unbounded storage of TLS Stream.
The issue dates to the decision back in 2017 to embed TLS in QUIC, and to send the TLS data in a series of “crypto streams”. Initially, the crypto stream was just a data stream with the number 0, but after discussion we modeled the TLS stream as a set of 3 “crypto streams”, one for each of the three stages of the TLS handshake: initial mode, handshake mode, and application mode.
The crypto streams are a calque of the data stream, with two key differences: they use they own “CRYPTO_HS” frames that are similar but different from the stream data frames, and there is no flow control. Instead of flow control, the Section 7.5 of [RFC9000](https://www.rfc-editor.org/rfc/rfc9000.html] suggests limiting “data received in out-of-order CRYPTO frames”. The idea is that data received “in order” can be fed directly to the TLS controller, and thus does not need to be buffered, so the limit only applies to “out of order” data. This is a bit more lenient than simply limiting the overall size of TLS messages, but not by much: if a TLS message is split into several packets and the first one is lost, the amount of out of order data will be more or less the size of the message.
Somehow, I missed the memo and did not implement this “crypto flow control” in Picoquic. Thanks to Asakura Mizu for reminding me! The fix was easy: add a test of the amount of out of order data in the code, then add a in the test suite to verify that. To do the test I had to use error injection, which should have been simple but needed a fix. Of course I made a mistake in writing this fix, but the fuzz test in the test suite caught the error. The usual.
And then, I found out that the 4096 bytes limit suggested in Section 7.5 of [RFC9000](https://www.rfc-editor.org/rfc/rfc9000.html] caused one existing test to fail. That test verifies that the code can abide by the amplification limit of 3 specified in RFC 9000 even if the server’s first flight is very large, 8K in my test requiring 8 packets. The test anticipates the large messages than may be needed if we handle the large keys of Post Quantum algorithms. And the test include injection of packet losses, thus almost 8K of “out of order crypto data” — which showed that the 4KB limit was too low. I bumped it to 16KB in the code, and that works now.
I can’t help thinking that this “maximum amount of out of order TLS data” should be exposed in the QUIC protocol. The current state is that any host picks a limit, but that means senders do not know how much data they can send exactly. It leads to a game of handshake roulette: send more data than the peers limit and it will work very well most of the time, but cause a connection failure if the wrong packet is lost. A sender that wants to be safe should treat the limit as a flow control window, initially set to the lowest common denominator (4096 bytes) and opening progressively as packets are acknowledged. That will work, but be very slow. Maybe we should have a transport parameter explaining the local value of the window. It would not help the very first packets, but it would help everything else. In particular, if the largest TLS message is the server first flight, knowing the client’s flow control limit would help sending that flight quickly and safely.