Christian Huitema's Latest Posts

Cloudy sky, waves on the sea, the sun is 
shining

The list of past posts including those previously published on Wordpress is available here.

Picoquic test coverage now at 90%

Posted on 04 Dec 2024

A couple weeks ago, I got a message from researchers who had found a couple of pretty nasty bugs in Picoquic. These were quickly fixed, but they gave me a bad feeling. In theory, the core functions of Picoquic were well tested, and bugs of this kind were fixed a long time ago. But the code went through quite a bit of churn, as more functionality like multipath, ack frequency and web transport were implemented. Expectations also changed. Was the testing still sufficient? In a word, no. The measurement using gcov and gcovr showed that for the main libraries, the test coverage stood at 84.5% of lines and 72.5% of branches. Some parts were well tested, but many of the new additions had only minimal coverage. The two bugs found by the researchers were found in HTTP3 parsing of queries, in HTTP3 code that was initially meant as a quick demonstration but that most users ended up using it as is. It was now time to suspend further development of Web Transport, Media over Quic or P2P Quic until the code quality has improved.

The effort concentrated on testing three main libraries:

In total, that’s 32,479 lines of code. I used the results of gcov to pick the worst offending areas one by one. For each of those, I wrote additional tests to reach the uncovered areas, and sometimes managed to remove existing code that was not useful in our scenario – at the end of the exercise, the code was 345 lines smaller, or 1.07%. In total, I did 12 PRs over the course of 3 weeks, before hitting a goal of at least 90% lines and 75% branches. The list is available in GitHub issue 1781 in the picoquic depot, summarized in this table:

  Cover lines Cover functions Cover branches
As of 2024/11/20 84.5% 89.5% 72.5%
Siduck removal 84.8% 89.8% 72.8%
Test getters and setters in the API 85.4% 91.6% 73.4%
More frame parsing / formatting tests 86.2% 92.0% 74.3%
Fix the spinbit tests 86.2% 92.1% 74.3%
Add tests for BBRv1 86.8% 92.7% 74.9%
Add tests for TLS API 87.3% 93.4% 75.2%
H3Zero Client Data tests 87.6% 93.6% 75.6%
Test of transport parameter code 87.7% 93.8% 75.7%
Demo client tests 88.2% 93.9% 76.2%
Test of logging functions 89.6% 95.7% 76.9%
Test of Web Transport & capsule 89.8% 95.8% 77.0%
Remove unused packet code 89.9% 95.8% 77.2%
Prune and test utility functions 90.1% 96.1% 77.3%

I tried to avoid writing tests that merely targeted the coverage metric. In some cases, that’s unavoidable. For example, the API includes a set of “getter” and “setter” functions, so applications can check variables without building a dependency on the implementation of the protocol. These are simple functions, but they do count in test coverage, so I ended up writing a bunch of simple tests for them. But in general, it is better to perform the test using a high level API. It won’t test only what happens when the code in the targeted branch is exercised, it will also test the impact of that behavior.

I kept the exercise focused on writing tests, changing the code as little as possible. Inspecting the code, it is obvious that some areas need improvement and possibly refactoring – particularly the way congestion control algorithms are written, see issue 1794, and the structure of the HTTP3 and Web Transport code, see issue 1795. But doing test improvement and code refactoring at the same time is risky, and could take much longer than expected.

The coverage of branches at the end of exercise remains at 77.3%. Many of the uncovered branches can only be triggered by a failure in memory allocation. There are known techniques for testing that, essentially replacing malloc by a test version, but I need to find a way to integrate this special version of malloc in the existing stress tests. An example would be driving the load on a server, accepting that some connections to the server will fail, but expecting that at least some will succeed and that the server will return to normal when the load diminishes. The problem is to isolate the malloc failures so they happen on the server but not on the client, knowing that for our tests client and server are running the same Quic code in the same process. Suggestions are welcome.

Did I find bugs? Of course. At the beginning of the exercise, I was concerned that merely driving the test metric may not improve the code all that much. But in practice it did. Many small bugs, but also a few big ones, the kind that lead to crashes or loops. Overall, it convinced me that not enforcing these test metrics earlier was a mistake!

Thanks to QUICTester/QUIC-Fuzz for finding 2 bugs in Picoquic

Posted on 17 Nov 2024

This is a story about bugs, but also about testing. On Tuesday November 12, I received an email from a research group investigating how to test QUIC protocol implementations. During their tests they identified two bugs in Picoquic, which both involve parsing HTTP 3 protocol headers. One bug involved reading one byte out of bound, the other dereferencing a NULL parser. These bugs are now fixed in the latest version of Picoquic (version 1.1.28.0, commit ce9bdb709657c10e8b2eb0233a3d31c365a2a07c).

The team found these bugs using a QUIC protocol specific fuzzer, QUICTester/QUIC-Fuzz, which is described very briefly as a grey-box mutation-based fuzzer. They will not provide more details for now, and I understand that. Academic researchers like them have to build a record of publication. You would not want to conduct a long research and have reviewer number 2 reject your submission “because the results are already well known”. I would thus refrain from speculating how they built it. My only comment is that their tool is working, and that they could effectively test implementations of encrypted protocols.

I used simpler techniques to reproduce the bug in my implementation. For each bug, I wrote or improved a unit test that directly fuzzed the internal API responsible for the bug – a much simpler task than testing a whole protocol implementation. Once I got a stable way to reproduce the bug, the fixes were relatively simple: tightening the check of bounds during the parsing of QPack messages and prevent the out of bound access, and better controlling memory allocation when receiving HTTP3 messages from the peer, to prevent dereferencing a NULL pointer.

Of course, I should have done this kind of API fuzzing and rooted out these bugs long ago. The Picoquic test suite does include systematic tests like that for the QUIC level messages, but the HTTP3 level tests were not quite as extensive. My preferred metaphor is that exploring the code for potential bugs is like exploring a dark basement with a flashlight. Each test in the test suite shines light on a small part of the basement, finding pockets of dirt and the nests of undesirable critters. But it takes very many flash lights before everything is explored. And it takes writing many more tests until most issues are rooted out. All help for that is welcome!

Going Peer to Peer with QUIC

Posted on 25 Oct 2024

Marten Seemann and I just published an entry on his blog, A p2p Vision for QUIC. This is a technical argument, explaining how a series of small improvements to the QUIC standards would suffice to make QUIC a great peer-to-peer (P2P) protocol. I personally love it, because the Internet suffers greatly from excessive centralization, with big platforms acting as gatekeepers.

Currently, pretty much all user activities on the Internet are mediated by “services in the cloud”. Take for example something as simple as a security camera in your home. People want to check the camera images when they are outside the home. They cannot connect to the camera or even a home server directly, because the home network is “behind a NAT” – and very possibly behind multiple layers of NAT, given the scarcity of IPv4 addresses. The practical solution is to export the images from the camera to a “service in the cloud”, and to let user contact that service from their cellphone. It works, but it has obvious issues for privacy, security, or even continuity of service: if the cloud service stops, the camera gets bricked.

The proposed QUIC architecture would provide a generic solution using the Masque standard for relaying QUIC, combine with QUIC NAT traversal extensions to migrate connections from “using the relay” to “directly P2P”, as shown in the following diagram.

            +------------------------------+
            |    Step 1: phone contacts    |
            v    the camera through    +---|---+
+-----------+-+  the relay             | Phone |
| Masque    ^ |                        +---|---+
| Relay     | |                            |
+-----------|-+  +-------------------------+
            |    |    Step 2: phone and camera cooperate
      +-----|----|-+  and establish a "hole" in the NAT.
      | NAT |    | |
      +-----|----|-+  Step 3: traffic flows directly
            |    |    between phone and camera without
          +-|----|-+  burdening the relay
          | Camera |
          +--------+

The standards to achieve that are almost ready. The relaying of incoming connections is described in the UDP Listen, already adopted in the Masque Working Group. A draft submitted to the IETF QUIC Working Group describes the Using QUIC to Traverse NAT, and another describes QUIC Address Discovery. I have already implemented the Address Discovery draft in picoquic, and intend to promptly implement the other two drafts.

Using QUIC has many advantages. The communications are encrypted end to end, even when they are going through the relay, which has nice security and privacy benefits. The relay is uses standards and is not dependent on the application; if a relay goes out of business, just pick another one. In the weird cases where the NAT traversal fails, the application still works.

There are of course many more applications of P2P possible, such as for example direct video calls between two homes, video games and more. The vision is clear: we can do all those applications using QUIC.

Loopback ACK delays cause Cubic slowdowns

Posted on 23 Oct 2024

The first investigation of loopback performance focused on the randomness of the RTT measurements on loopback, and the fix for BBR was to not make congestion control decisions based on delays if the RTT were small. After that, the results “looked good”, although there was quite a bit of variability in the results. The tests were made by sending 1GB of data. Looking at traces, it seemed that many connections were starting to slow down at the end of the 1GB transfer. I thus repeated the experiments, this time sending 10GB of data instead of 1GB. That was quite instructive: the data rate of a Cubic connection dropped by a factor 2!

figure 1: Comparing loopback data
 rate for 1GB and 10GB

The figure 1 shows the result. Each of the lines correspond to 16 connections: downloading either 1GB or 10GB between a picoquic client and a picoquic server on my laptop (Dell XPS 16, Intel(R) Core(TM) Ultra 9 185H 2.50 GHz, Windows 11 23H2) for each of BBR and Cubic, using the PERF application protocol. The Cubic line for 10 GB transfers shows data rates varying from 1.3 to 1.8 Gbps, a stark contrast with the 1,7 to 3.5 Gbps observed on the 1GB downloads.

Checking the statistics for the various connections did not show any excessive rate of packet retransmission, but it did show that the client sent only about 50,000 acknowledgements per connection, which translates to an acknowledgement every 1.1ms. That does not compare well with the average RTT of 65 microseconds. These long intervals between acknowledgements could be traced to the “ACK Frequency” implementation: the value of the “Min ACK delay” was set to 1ms, and the value of the “max gap” was set to 1/4th of the congestion window. Detailed traces could show that this congestion window was initially quite large, which explained why ACKs were so infrequent. Infrequent ACKs means that the Cubic code was not getting feedback quickly enough, and reacted by slowing down over time.

I tried a series of more or less complex fixes, but the simplest and most effective one was to cap the ack gap value to 32 if the minimum RTT was smaller than 4 times the ACK delay. As you can see in figure 2, the effect on the Cubic performance was drastic. The average data rate jumped to 3.4 Gbps, and the variability was much reduced, from 3.3 to 3.5 Gbps. The effect on BBR was also positive, but less marked.

figure 2: Loopback data rate before and after capping
ACK gap to 32

An ACK gap of 32 corresponds to about 100 microseconds between ACK in the Cubic trials. This is still much larger than the min RTT, but close to the average smoothed RTT. Trying to get more frequent acknowledgements was counter productive, probably because the processing of more acknowledgements increased the CPU overhead. Trying higher values like 64 was also counterproductive, because Cubic would slow down the connection.

This experiment and the previous one outlined that “very high bandwidth very low latency” connections are a special case. The code is close to saturating the CPU, which forces a reduction in the ACK rate, with the average delay between acknowledgements often larger than the actual min. The variations of delay measurement for an application level transport are quite high, which forces us to not use delay measurements to detect queues. The main control variable is then the rate of packet losses, which ended up between 0.4 and 0.5% in the experiments.

It would be nice if the interface could implement ECN markings, so congestion control could react to increased queues before losing packets. Failing that, it might make sense to monitor the “bottleneck data rate”, not necessarily to deduce a pacing rate like BBR does, but at least to stop increasing the congestion window if the data rate is not increasing. But that’s for further research.

Random loopback delays cause BBR slowdowns.

Posted on 13 Oct 2024

I did a quick experiment last week, measuring performance of picoquic on a loopback interface. The goal was to look at software performance, but I wanted to tease out how much of the performance depends on CPU consumption and how much depends on control, in particular congestion control. I compared connections using Cubic and my implementation of BBRv3. I observed that BBRv3 was significantly slower than Cubic. The investigation pointed out the dependency of BBRv3 on RTT measurements, which in application level implementations contain a lot of jitter caused by process scheduling during calls to OS APIs. This blog presents the problem, the analysis, the first fixes, and the open question of bandwidth measurement in low latency networks.

figure 1: loopback data rate using BBR and Cubic

The result of the loopback tests are displayed in figure 1. I ran 16 QUIC connections between a picoquic client and a picoquic server on my laptop (Dell XPS 16, Intel(R) Core(TM) Ultra 9 185H 2.50 GHz, Windows 11 23H2) for each of BBR and Cubic, downloading 1GB from the server using the PERF application protocol. The figure shows the observed values of the data rate in Mbps, 16 numbers for each congestion control protocol, organized for each protocol in a line from lowest to highest. The first tests showed that BBR is markedly slower, averaging 1.7Gbps versus 3 Gbps for Cubic. The BBR data rates are much more variable, between 1.1 and 2.4 Gbps, versus between 2.6 and 3.8 Gbps for Cubic.

To analyze the results, I had to add “in memory” logging to the picoquic stack. The current logging writes on disk, which impacts performance, and thus cannot easily be used to understand performance at high speed. I instrumented the code to log a bunch of variables in a large table in memory. That requires allocating a couple 100MB at program start for logging one connection, so it is not something that should be done in production. But it does work on single connections, and the impact on perf is OK. I dug into 2 connections: one using Cubic, getting 3Gbps on loopback on Windows; another using BBR and getting 2Gbps.

The Cubic connection is easy to understand: Cubic basically goes out of the way. The CWND is always much larger than the bytes in transit, the pacing rate is very high. The sender ends up CPU bound. On the other hand, the analysis of the BBR logs showed that the sending rate was clearly limited by the pacing rate set by BBR.

figure 2: Traces of a BBR loopback 
connection

The figure 2 shows the evolution of the pacing rate (dotted orange line) and the measured throughput (dark green line) over 4 seconds of connection. The figure also shows the RTT measurements (downsampled from 700000 to about 20000) in microseconds at the bottom of the graph, and the evolution of the BBR internal state as a light blue line. The graph shows that throughput is clearly limited by the pacing rate, but that the pacing rate varies widely. This was surprising, because BBR v3 normally alternates between probing phases and long periods in “cruise” state. The graph also shows the variations of the RTT, with RTT spikes present at about the same times as the transitions in the pacing rate.

figure 3: RTT histogram for loopback connection

The figure 3 displays an histogram of the RTT samples throughout the connection. The values are widely dispersed, from a minimum of 20 microseconds to outliers up to 1400 microseconds. Congestion controls based on delay measurements assume that increases in delay indicate queues building up in the network, and should be taken as a signal to slow down the connection. However, the graph on figure 2 does not show a clear correlation between sending rate and queue buildup.

Picoquic runs in the application context. Receiving and acknowledgement requires waiting for an event until the asynchronous receive call is complete, and then receiving the data. Previous measurements show that the delay between raising an event on a thread and return from waiting on another is usually very short, maybe a few microseconds, but can sometime be delays by tens of microseconds, or maybe hundreds. These delays will happen randomly. If the path capacity is a few hundred megabits or less, these random delays are much lower than the delays caused by transmission queues. But in our case, these delays can have a nasty effect.

The BBRv3 implementation in picoquic includes improvements to avoid queue buildups and maintain the responsiveness of real time application, see this draft. Two of the proposed improvements are “exiting startup early if the delays increase” and “exiting the probeBW UP state early if the delay increases”. This works quite well on most long distance Internet connection, but in our case, they become “exit the probe state randomly because the measure of delay increased randomly”. Not good.

figure 4: Comparing updated BBR to original
BBR and Cubic

The solution was to disable these two “delay based” tests in BBR if the minimum delay of the path is very short. As a first test, we define “very short” as “less than 128 microseconds”. We may have to review that value in the future, but figure 4 shows that disabling the two tests results in a marked improvement. Figure 4 is identical to figure 1, except for an extra line showing 16 tests with our improved version of BBR. This version achieves data rates between 1.4 and 3.5 Gbps, and is on average 50% faster than the unmodified version. Lots of results for changing just two lines of code, but as usual the problem is to find out which lines…

The results are improved, but BBR is still about 15% slower than Cubic in this scenario. I suspect that this is because the randomness in delay measurements also causes randomness in the bandwidth measurements, and that this randomness somehow causes slow-downs. In figure 2, we see periods in which BBR is in “cruise” state, yet the pacing rate suddenly drops. This could be caused by BBR receiving an abnormally slow measurement and dropping the pacing rate accordingly. The next bandwidth measurements are limited by that new pacing rate, so the rate has effectively “drifted down”. Another problem that we will need to investigate…