Christian Huitema's Latest Posts

Cloudy sky, waves on the sea, the sun is 
shining

The list of past posts including those previously published on Wordpress is available here.

QUIC, Multipath and Congestion Control

Posted on 24 Mar 2025

The IETF draft defining multiparty extension for QUIC is almost ready, but there was an interesting debate during the IETF meeting in Bangkok about its status. One of the reasons is that we have only a few implementations, and not all that much deployment in production. The other reason is that congestion control for multiparty is still somewhat experimental. The draft is carefully worded to only use stable references, merely following RFC9000 and stating that there should be an independent congestion controller for each path, meaning that a QUIC multiparty connection with N paths will be equivalent to as many single path connections. Which seems reasonable, but is in fact a bit of a low bar. For example, in RFC6356, we see three more ambitious goals:

These goals are implicitly adopted in the latest specification for Multipath TCP in RFC8684.

The QUIC extension for multipath only mention a subset of these goals, stating that Congestion Control must be per-path, as specified in RFC9000. The guiding principle is the same as the Goal 2 of RFC6356, ensuring that a QUIC multipath connection using multiple paths will not use more resource than a set of independent QUIC connections running on each of these paths.

We could have a philosophical discussion about what constitute “fairness” in multipath. On one hand, we could consider that a connection between two endpoints A and B should be “fair” versus any other connections between these points. If it can use Wi-Fi and 5G at the same time, that means it should not be using more resource that the best of Wi-Fi and 5G. On the other hand, users can argue that they are paying for access to both networks, and should be able to use the sum of the resource provided by these two networks. The first approach requires users to be cooperative, the second accepts that users will be selfish. In practice, it is probably better to engineer the network as if users would be selfish – because indeed they will be. But even if we do accept selfishness, we still have at least two congestion control issues, such as shared bottleneck and simultaneous fallback.

Suppose that a QUIC congestion uses two paths, defined by two set of source and destination IP addresses. In practice, QUIC paths are initiated by clients, and the two paths are likely to use two different client IP addresses, but a single server address. We could easily find configuration in which the server side connection is the bottleneck. In that case, the congestion controllers on the two paths could be playing a zero-sum game. If one increases its sending rate, the other one will experience congestion and will need to slow down.

The counter-productive effects of this kind of shared path competition can be mitigated by “coupled congestion control” algorithms. The idea is to detect that multiple paths are using the same bottleneck, for example by noticing that congestion signals like packet losses, ECN marks or delay increases are happening simultaneously on these paths. When that happens, it might be wise for the congestion managers on each path to cooperate, for example by increasing sending rates slower than if they were alone.

The experimental algorithm proposed in RFC6356 is an example of that approach. It is a variation of the New Reno algorithm. Instead of always increasing the congestion window for a flow by “one packet per RTT per path”, it would use the minimum of that and something like “one packet per RTT across all paths”. (The actual formula is more complex, see the RFC text for details.)

I have two issues with that: it seems a bit too conservative, and in any case it will only be efficient on those paths where new Reno is adequate. Most of the Internet connections are using more modern congestion algorithms than new Reno, such as Cubic or BBR. We would have to develop an equivalent coupling algorithm for the these algorithms. But we can also notice that the formula will slow the congestion window increase even if the paths are not coupled, which is clearly not what selfish users would want. In short, this is still a research issue, which explains why the QUIC multipath draft does not mandate any particular solution.

Another issue with multipath is the “simultaneous backup” problem. Many multipath configurations are aiming for redundancy rather than load sharing. They will maintain an active path and a backup path, sending most of the traffic on the active path, and only switching to the backup path if they detect an issue. For example, they would use a Wi-Fi connection until it breaks, and then automatically start sending data on a 5G connection. The problem happens when multiple connections on multiple devices do that at the same time. They were all using the same Wi-Fi network, they all detect an issue are about the same time, so they all switch to using the 5G network around the same time. That’s the classic “thundering herd” problem – an instant surge of traffic causing immediate congestion as all these connections compete for the same 5G radio frequencies.

The “thundering herd” problem will solve itself eventually, as all connections notice the congestion and reduce their sending rate, but it would be nice if we avoided the packet losses and increased delays when these “simultaneous backups” happen. The classic solutions are to introduce random delays before backing up, and also to probe bandwidth cautiously after a backup. Again, this is largely a research issue. My main recommendation would be for networks to implement something like L4S (see RFC9330, so each connection will receive “congestion experienced” ECN marks and quickly react.

So, yes, we do have a couple of research issues for congestion control and multipath…

Picoquic test coverage now at 90%

Posted on 04 Dec 2024

A couple weeks ago, I got a message from researchers who had found a couple of pretty nasty bugs in Picoquic. These were quickly fixed, but they gave me a bad feeling. In theory, the core functions of Picoquic were well tested, and bugs of this kind were fixed a long time ago. But the code went through quite a bit of churn, as more functionality like multipath, ack frequency and web transport were implemented. Expectations also changed. Was the testing still sufficient? In a word, no. The measurement using gcov and gcovr showed that for the main libraries, the test coverage stood at 84.5% of lines and 72.5% of branches. Some parts were well tested, but many of the new additions had only minimal coverage. The two bugs found by the researchers were found in HTTP3 parsing of queries, in HTTP3 code that was initially meant as a quick demonstration but that most users ended up using it as is. It was now time to suspend further development of Web Transport, Media over Quic or P2P Quic until the code quality has improved.

The effort concentrated on testing three main libraries:

In total, that’s 32,479 lines of code. I used the results of gcov to pick the worst offending areas one by one. For each of those, I wrote additional tests to reach the uncovered areas, and sometimes managed to remove existing code that was not useful in our scenario – at the end of the exercise, the code was 345 lines smaller, or 1.07%. In total, I did 12 PRs over the course of 3 weeks, before hitting a goal of at least 90% lines and 75% branches. The list is available in GitHub issue 1781 in the picoquic depot, summarized in this table:

  Cover lines Cover functions Cover branches
As of 2024/11/20 84.5% 89.5% 72.5%
Siduck removal 84.8% 89.8% 72.8%
Test getters and setters in the API 85.4% 91.6% 73.4%
More frame parsing / formatting tests 86.2% 92.0% 74.3%
Fix the spinbit tests 86.2% 92.1% 74.3%
Add tests for BBRv1 86.8% 92.7% 74.9%
Add tests for TLS API 87.3% 93.4% 75.2%
H3Zero Client Data tests 87.6% 93.6% 75.6%
Test of transport parameter code 87.7% 93.8% 75.7%
Demo client tests 88.2% 93.9% 76.2%
Test of logging functions 89.6% 95.7% 76.9%
Test of Web Transport & capsule 89.8% 95.8% 77.0%
Remove unused packet code 89.9% 95.8% 77.2%
Prune and test utility functions 90.1% 96.1% 77.3%

I tried to avoid writing tests that merely targeted the coverage metric. In some cases, that’s unavoidable. For example, the API includes a set of “getter” and “setter” functions, so applications can check variables without building a dependency on the implementation of the protocol. These are simple functions, but they do count in test coverage, so I ended up writing a bunch of simple tests for them. But in general, it is better to perform the test using a high level API. It won’t test only what happens when the code in the targeted branch is exercised, it will also test the impact of that behavior.

I kept the exercise focused on writing tests, changing the code as little as possible. Inspecting the code, it is obvious that some areas need improvement and possibly refactoring – particularly the way congestion control algorithms are written, see issue 1794, and the structure of the HTTP3 and Web Transport code, see issue 1795. But doing test improvement and code refactoring at the same time is risky, and could take much longer than expected.

The coverage of branches at the end of exercise remains at 77.3%. Many of the uncovered branches can only be triggered by a failure in memory allocation. There are known techniques for testing that, essentially replacing malloc by a test version, but I need to find a way to integrate this special version of malloc in the existing stress tests. An example would be driving the load on a server, accepting that some connections to the server will fail, but expecting that at least some will succeed and that the server will return to normal when the load diminishes. The problem is to isolate the malloc failures so they happen on the server but not on the client, knowing that for our tests client and server are running the same Quic code in the same process. Suggestions are welcome.

Did I find bugs? Of course. At the beginning of the exercise, I was concerned that merely driving the test metric may not improve the code all that much. But in practice it did. Many small bugs, but also a few big ones, the kind that lead to crashes or loops. Overall, it convinced me that not enforcing these test metrics earlier was a mistake!

Thanks to QUICTester/QUIC-Fuzz for finding 2 bugs in Picoquic

Posted on 17 Nov 2024

This is a story about bugs, but also about testing. On Tuesday November 12, I received an email from a research group investigating how to test QUIC protocol implementations. During their tests they identified two bugs in Picoquic, which both involve parsing HTTP 3 protocol headers. One bug involved reading one byte out of bound, the other dereferencing a NULL parser. These bugs are now fixed in the latest version of Picoquic (version 1.1.28.0, commit ce9bdb709657c10e8b2eb0233a3d31c365a2a07c).

The team found these bugs using a QUIC protocol specific fuzzer, QUICTester/QUIC-Fuzz, which is described very briefly as a grey-box mutation-based fuzzer. They will not provide more details for now, and I understand that. Academic researchers like them have to build a record of publication. You would not want to conduct a long research and have reviewer number 2 reject your submission “because the results are already well known”. I would thus refrain from speculating how they built it. My only comment is that their tool is working, and that they could effectively test implementations of encrypted protocols.

I used simpler techniques to reproduce the bug in my implementation. For each bug, I wrote or improved a unit test that directly fuzzed the internal API responsible for the bug – a much simpler task than testing a whole protocol implementation. Once I got a stable way to reproduce the bug, the fixes were relatively simple: tightening the check of bounds during the parsing of QPack messages and prevent the out of bound access, and better controlling memory allocation when receiving HTTP3 messages from the peer, to prevent dereferencing a NULL pointer.

Of course, I should have done this kind of API fuzzing and rooted out these bugs long ago. The Picoquic test suite does include systematic tests like that for the QUIC level messages, but the HTTP3 level tests were not quite as extensive. My preferred metaphor is that exploring the code for potential bugs is like exploring a dark basement with a flashlight. Each test in the test suite shines light on a small part of the basement, finding pockets of dirt and the nests of undesirable critters. But it takes very many flash lights before everything is explored. And it takes writing many more tests until most issues are rooted out. All help for that is welcome!

Going Peer to Peer with QUIC

Posted on 25 Oct 2024

Marten Seemann and I just published an entry on his blog, A p2p Vision for QUIC. This is a technical argument, explaining how a series of small improvements to the QUIC standards would suffice to make QUIC a great peer-to-peer (P2P) protocol. I personally love it, because the Internet suffers greatly from excessive centralization, with big platforms acting as gatekeepers.

Currently, pretty much all user activities on the Internet are mediated by “services in the cloud”. Take for example something as simple as a security camera in your home. People want to check the camera images when they are outside the home. They cannot connect to the camera or even a home server directly, because the home network is “behind a NAT” – and very possibly behind multiple layers of NAT, given the scarcity of IPv4 addresses. The practical solution is to export the images from the camera to a “service in the cloud”, and to let user contact that service from their cellphone. It works, but it has obvious issues for privacy, security, or even continuity of service: if the cloud service stops, the camera gets bricked.

The proposed QUIC architecture would provide a generic solution using the Masque standard for relaying QUIC, combine with QUIC NAT traversal extensions to migrate connections from “using the relay” to “directly P2P”, as shown in the following diagram.

            +------------------------------+
            |    Step 1: phone contacts    |
            v    the camera through    +---|---+
+-----------+-+  the relay             | Phone |
| Masque    ^ |                        +---|---+
| Relay     | |                            |
+-----------|-+  +-------------------------+
            |    |    Step 2: phone and camera cooperate
      +-----|----|-+  and establish a "hole" in the NAT.
      | NAT |    | |
      +-----|----|-+  Step 3: traffic flows directly
            |    |    between phone and camera without
          +-|----|-+  burdening the relay
          | Camera |
          +--------+

The standards to achieve that are almost ready. The relaying of incoming connections is described in the UDP Listen, already adopted in the Masque Working Group. A draft submitted to the IETF QUIC Working Group describes the Using QUIC to Traverse NAT, and another describes QUIC Address Discovery. I have already implemented the Address Discovery draft in picoquic, and intend to promptly implement the other two drafts.

Using QUIC has many advantages. The communications are encrypted end to end, even when they are going through the relay, which has nice security and privacy benefits. The relay is uses standards and is not dependent on the application; if a relay goes out of business, just pick another one. In the weird cases where the NAT traversal fails, the application still works.

There are of course many more applications of P2P possible, such as for example direct video calls between two homes, video games and more. The vision is clear: we can do all those applications using QUIC.

Loopback ACK delays cause Cubic slowdowns

Posted on 23 Oct 2024

The first investigation of loopback performance focused on the randomness of the RTT measurements on loopback, and the fix for BBR was to not make congestion control decisions based on delays if the RTT were small. After that, the results “looked good”, although there was quite a bit of variability in the results. The tests were made by sending 1GB of data. Looking at traces, it seemed that many connections were starting to slow down at the end of the 1GB transfer. I thus repeated the experiments, this time sending 10GB of data instead of 1GB. That was quite instructive: the data rate of a Cubic connection dropped by a factor 2!

figure 1: Comparing loopback data
 rate for 1GB and 10GB

The figure 1 shows the result. Each of the lines correspond to 16 connections: downloading either 1GB or 10GB between a picoquic client and a picoquic server on my laptop (Dell XPS 16, Intel(R) Core(TM) Ultra 9 185H 2.50 GHz, Windows 11 23H2) for each of BBR and Cubic, using the PERF application protocol. The Cubic line for 10 GB transfers shows data rates varying from 1.3 to 1.8 Gbps, a stark contrast with the 1,7 to 3.5 Gbps observed on the 1GB downloads.

Checking the statistics for the various connections did not show any excessive rate of packet retransmission, but it did show that the client sent only about 50,000 acknowledgements per connection, which translates to an acknowledgement every 1.1ms. That does not compare well with the average RTT of 65 microseconds. These long intervals between acknowledgements could be traced to the “ACK Frequency” implementation: the value of the “Min ACK delay” was set to 1ms, and the value of the “max gap” was set to 1/4th of the congestion window. Detailed traces could show that this congestion window was initially quite large, which explained why ACKs were so infrequent. Infrequent ACKs means that the Cubic code was not getting feedback quickly enough, and reacted by slowing down over time.

I tried a series of more or less complex fixes, but the simplest and most effective one was to cap the ack gap value to 32 if the minimum RTT was smaller than 4 times the ACK delay. As you can see in figure 2, the effect on the Cubic performance was drastic. The average data rate jumped to 3.4 Gbps, and the variability was much reduced, from 3.3 to 3.5 Gbps. The effect on BBR was also positive, but less marked.

figure 2: Loopback data rate before and after capping
ACK gap to 32

An ACK gap of 32 corresponds to about 100 microseconds between ACK in the Cubic trials. This is still much larger than the min RTT, but close to the average smoothed RTT. Trying to get more frequent acknowledgements was counter productive, probably because the processing of more acknowledgements increased the CPU overhead. Trying higher values like 64 was also counterproductive, because Cubic would slow down the connection.

This experiment and the previous one outlined that “very high bandwidth very low latency” connections are a special case. The code is close to saturating the CPU, which forces a reduction in the ACK rate, with the average delay between acknowledgements often larger than the actual min. The variations of delay measurement for an application level transport are quite high, which forces us to not use delay measurements to detect queues. The main control variable is then the rate of packet losses, which ended up between 0.4 and 0.5% in the experiments.

It would be nice if the interface could implement ECN markings, so congestion control could react to increased queues before losing packets. Failing that, it might make sense to monitor the “bottleneck data rate”, not necessarily to deduce a pacing rate like BBR does, but at least to stop increasing the congestion window if the data rate is not increasing. But that’s for further research.