Christian Huitema's Latest Posts

Cloudy sky, waves on the sea, the sun is 
shining

The list of past posts including those previously published on Wordpress is available here.

The Retire Connection ID stuffing Attack against QUIC

Posted on 12 Mar 2024

Back in December, Marten Seeman found an attack against QUIC. Malevolent clients could exploit the “path validation” mechanism to create large queues of “Path Response” messages, eventually saturating the server”s memory. Back then, it turned out that the picoquic implementation was not vulnerable, because it only responded to the last challenge received. Marten has since found another similar issue, this time exploiting the handling of “NEW CONNECTION ID” frames. And this time, I did have to fix the picoquic code.

QUIC was designed with security in mind. There were lots of reviews, lots of discussions, and the development of formal proofs that the security of QUIC is equivalent to that of TLS 1.3. There are known limit to that security: on path observers can cause the initial handshake to fail before the session keys are fully negotiated; they can “fingerprint” the encrypted traffic and match it to known patterns; on path attackers can drop packets or mess with IP headers. These are very hard issues, the kind that will need serious efforts to resolve – QUIC is not better than TLS1.3 in that regard. But outside of that, most attacks against QUIC are attacks against implementations that did not correctly implement the specification. Marten’s attacks are different, because they are based on the QUIC specification itself, and affect potentially every implementation.

QUIC packet headers include a “connection identifier” that allows node to link packets to connection contexts. The connections ID are created by the QUIC node that will eventually decode them in packet headers: the client uses connection IDs provide by the server in the packets sent to the server, and vice versa. To facilitate that, each node sends “NEW CONNECTION ID” frames to its peer. And to control the use of resource, both nodes tell their peer how many “NEW CONNECTION ID” they are willing to receive – trying to send more than that triggers a protocol violation. When a node does not need an old connection ID anymore, it sends a “retire connection ID” frame, and the peer will provide a new one to replace it.

But there is a catch. The servers and clients need to maintain a list of valid connection identifiers so they can successfully direct packets to connection contexts. In big server farms, this is done by the load balancers, and it is often done using encryption and decryption: the connection ID typically contains bytes encrypting a server identifier and a random number, with a key known to the load balancer, see for example the QUIC-LB draft. These keys are rotated regularly, and when an old key is discarded, the corresponding connection ID need to be discarded and new ones need to be provided. The new connection IDs will be carried in “NEW CONNECTION ID” frames, with an attribute asking to retire connection IDs with a sequence number lower than a “retire prior” value.

This is the mechanism exploited in the attack. The malevolent client will send a series of “NEW CONNECTION ID” frames saying something like “this is connection ID number N, please retire connection ID number N-1”. The server will accept the frame because once the previous connection ID has been retired, the total number of connection ID provided by the peer remains at or below the maximum number allowed. The server will also send a “RETIRE CONNECTION ID” frame confirming that the old connection ID number “N-1” has been retired.

That, by itself, does not sound too bad, but wait. The client sends a series of “NEW CONNECTION ID” frames to force the server to send a series of as many “RETIRE CONNECTION ID” frames. Clients have many other ways to cause the server to send traffic, such as for example requesting web pages. But all other types of traffic are regulated by flow control mechanisms, which limit how much data will be queued. In contrast, there is no limit to the number of “RETIRE CONNECTION ID” frames that could be queued.

Still, these frames are very small, so they would normally not be queued for very long. The attack only works if the client manages to slow down the server, for example by only acknowledging a fraction of the packets that the server sends. This will cause the congestion control mechanisms to reduce the congestion window, dropping eventually to a minimum size like 1 or 2 packets per RTT. Each such packet would carry several hundred “RETIRE CONNECTION ID” frames. The queues will start to build up if the client sends thousands of “NEW CONNECTION ID” frames per RTT. There are certainly network scenarios in which that is doable. In those scenarios, the queues will build up, and the process memory will keep growing. Do that long enough and the server will run out of memory, especially if it is a small server, such as an embedded system.

The discussion above only considers the sized of the queued frames, which is the only effect per specification. But implementation choices may make servers more sensitive to the attack. For example, some servers may keep the memory allocation for a connection ID until the “RETIRE CID” frame has not just been sent, but also acknowledged by the peer. In that case, the attack will cause much larger memory allocations. And then, the code handling connection IDs may be designed for the small number of CID expected, but the attack could increase the size of the table, the handling would be inefficient, and the CPU load will increase.

The attack is not hard to mitigate: just limit the number of “RETIRE CONNECTION ID” frames that the server is willing to queue, detect an anomaly and break the connection if the client sends more than that. But it is yet another example that designing network protocols is hard. Dozens of engineers have reviewed the QUIC specification, yet Marten found the issue three years after the specification was complete. We have seen reviewed the specification again and could not find other similar issues. But I guess that we have to keep looking!

Oh, and if you are building a project using picoquic, the fix for that issue was merged on March 12 at 18:00 UTC. Please update to a recent version!

The long slow path to QUIC multipath

Posted on 28 Feb 2024

I started working on QUIC multipath in 2017, see this old draft. 7 years and 21 drafts later, we have still not converged on a solution. We thought we were close last November with version 6 of the IETF draft, but no, we had to go back to the drawing board.

At this point, my implementation of QUIC in picoquic supports three ways to implement multiparty:

The simple version made just the minimal set of changes to RFC 9000 to allow transmission over multiple paths in parallel. All application data packets were numbered from the same number space, regardless of the transmission path. This leaves the protocol simple, which is very good if we only need to support simple scenarios like “make a transition to the new path before breaking the old one.” On the other hand, if the scenario requires high speed transmission on parallel paths, implementers need to be very careful in the way they handle acknowledgements and detect packet losses. With enough care, the performance is within a few percent of the more complex solutions, but there is still a hole: we would need more work to properly track ECN marks per path. Most developers were convinced that we needed instead a solution that manages multiple number spaces, so we can keep the ACK management and loss discovery algorithms of RFC 9000, without extra complexity.

Using multiple number spaces means that packets sent on a path could be numbered “naturally” after the order of transmissions on that path. That means we could have packets 1, 2, 3, 4, etc. on Path A, and different packets with the same numbers of path B. Loss recovery is simpler, but we have a problem with encryption. QUIC relies on AEAD, AEAD encryption requires unique “nonce” per packet, and RFC 9001 solves that by using the 64 bit packet number of QUIC as a nonce. But if we have different packets with the same numbers, the nonce is not unique anymore and simple tricks can be used to guess the contents of packets with colliding nonce. The solution is to build a 96 bit nonce that combines 64 bits of sequence number and up to 32 bits of path identifier.

The version 6 draft does just that. The design starts by observing that all QUIC packets start with a “Connection Identifier” (CID) — a string of bytes that are unique to a connection. In principle, a CID is used on only one path, if only for privacy reasons. The CIDs are allocated by one node and sent to the other in “New Connection ID” frames, and are identified by a “CID sequence number”. We can use that sequence number to identify a path, but when we do that we are also using it to identify a packet number space. That works, but require having data structure in which packet numbers and ACK management are “per CID”, which is only an approximation of “per path”. It is a pretty good approximation and it works well most of the time, but there is a corner case.

QUIC nodes will often perform “CID renewal” for privacy reasons. Since the CID value is carried in each packet, renewing the CID breaks the easy correlation between an old series of packets with the old CID, and the new series with the new one. That’s a gain for privacy. The draft 6 handles that easily: when a packet arrives with a new CID but the same source and destination IP addresses and port numbers as some previously defined path, we can handle that as “CID renewal”. We will start a new packet sequence number, but we can retain old path properties like round trip time (RTT) measurements and congestion control state.

The paths carrying QUIC packets sometimes incorporate Network Address Translation devices (NAT). These NAT map an incoming IP address and UDP port to different outgoing values, a relation described as “port mapping”. In principle, the port mappings remain stable long enough for most connections, but sometimes they change in the middle of a connection. QUIC nodes detect a mapping change when they receive packets with the same CID as an old path, but a different IP address and port. And then they apply “NAT rebinding” rules to continue the connection while keeping the timing and congestion control information of the path.

But what if the CID renewal happens at the same time as a NAT rebinding? This is not a theoretical question, because nodes will want to renew the CID when there is a significant time gap between the old series of packets and the new one, and time gaps are exactly when NAT may lose the port mappings. The simple “CID based” path identification breaks when NAT rebinding and CID renewal happen at the same time. The new series of packets will be treated as a completely new path, requiring new RTT measurements and reinitialization of the congestion control algorithm. Arguably this does not happen very often, and some rare reinitialization of congestion control is not a very big deal. But still, this is a weakness in the design.

Enter the third proposal, which starts by redefining how CIDs are organized. RFC 9000 organizes a series of CIDs with unique sequence numbers within the connection. The new design organizes the set of CID per path, or rather per potential path. Each CID has a path identifier, and a sequence number that differentiates it from other CIDs in the same path. When doing CID renewal, nodes must replace the CID of the path with another with the same path identifier. The encryption nonce is composed of the packet sequence number and the path identifier. The sequence number is per path instead of per CID. If NAT rebinding happens in the middle of a CID renewal, the path identifier derived from the CID tells us that this is the same path, and the node can retain their path information. It is a nice structure, and many path management functions become simpler.

Of course, this has a cost. Path management is simpler, but CID management is more complex. The proposal has to replace the New Connection ID and Retire Connection ID frames with new “Multipath” variants, and to introduce a mechanism to control the flow of New Connection IDs. There are still options to be discussed, such as whether the CID chosen by the sender and the receiver should have the same path number, or the details of NAT Rebinding. Plus there is the cost to implementers — it took me a week to implement the “unique Path ID” proposal in picoquic, and I probably need to work some more to reach the quality of the previous code.

In the old days, the IETF was prone to find a solution that was good enough, ship it, gather experience, and then revise the standard later to fill in the gaps. If we had followed that process, we could probably have published a QUIC Multipath RFC last year, or maybe the year before — the draft 6 was definitely good enough, and the previous draft was probably OK as well. But we have decided instead to discuss all details before approving the draft. The result will probably be better, although gathering experience sooner would also have helped improve quality. In any case, the process is not quick. As things go, we will be lucky if we converge on a proposal by the next IETF meeting in July 2024, and even November 2025 will be challenging!

QUIC timeouts and Handshake Interop

Posted on 19 Jan 2024

Marten Seemann made a great contribution to QUIC interoperability by setting the QUIC interop runner. The site runs series of interoperability tests between participating QUIC implementations (17 of them when I am writing this) and reports that in a large result matrix. It is a nice complement to the internal tests of the implementations, and it was flagging an interesting issue: the test L1 was failing between ngtcp2 client and picoquic server.

The test codenamed L1 verifies that implementations can successfully establish connections in presence of high packet loss. The test consists of 50 successive connection attempts, followed by the download of a short 1KB document. The connections are run over a network simulation programmed to drop 30% of packets. The test succeeds if all connections succeed and all 50 documents are retrieved.

In the “ngtcp2 to picoquic” tests, all documents were properly downloaded, but the analysis of traffic showed 51 connection attempts instead of the expected 50, and thus the test was marked failing. It took me a while to parse the various logs and understand why this was happening, but it turned out to be a timeout issue. One of the 50 tests ran like this:

Nobody is really at fault here — NGTCP2 behaves exactly as the standard mandates, and it is perfectly legal for the Picoquic server to drop contexts after absence of activity for some period. In fact, servers should to do just that in case of DOS attacks. But explaining to the testers that “we are failing your test because it is too picky” is kind of hard. There was a simpler fix: just configure Picoquic to use longer timers, 180 seconds instead of 30. With that, the context is still present when the finally successful repeat packet arrives. Picouic creates just one connection, and everybody is happy.

But still, Picoquic was using a short handshake timer for a reason: if connections are failing, it makes sense to clean them up quickly. The L1 test between Picoquic client and server was passing despite the short timers, because Picoquic’s loss recovery process is more aggressive than what the standard specifies. The standard specifies a conservative strategy that uses “exponential backoff”, doubling the value of the timer after each failure, for the following timeline:

Time (standard) Number Timeout(ms)
0 1 300
300 2 600
900 3 1200
2100 4 2400
4500 5 4800
9300 6 9600
18700 7 19200
37900 8 38400
76300 9 76800

Picoquic deviates from that strategy, as discussed in Suspending the Exponential Backoff. The timeline is much more aggressive:

Time (picoquic) Number Timeout(ms)  
0 1 250  
250 2 250 Not doubling on first PTO
500 3 500  
1000 4 1000  
2000 5 1875 Cap timer to 1/16th of 30s timer
3875 6 1875  
5750 7 1875  
7625 8 1875  
9500 9 1875  

After configuring the handshake timer to 180 seconds, the Picoquic sequence is still more aggressive than the standard, but the difference is smaller:

Time (picoquic) Number Timeout(ms)  
0 1 250  
250 2 250 Not doubling on first PTO
500 3 500  
1000 4 1000  
2000 5 2000  
6000 6 4000  
10000 7 8000  
18000 8 11250 Cap timer to 1/16th of 180s timer
29250 9 11250  

In our test, it seems that not being much more aggressive than the peer did result in the behavior that the testers expected. In real life, I think that the intuitions developed in the previous blog still hold. It is just that for the test, we have to please the protocol police…

Optimizing QUIC performance

Posted on 12 Dec 2023

I am following the discussions on the IETF mailing lists, and I got engaged in a review of a proposal by Fred templin to better implement packet fragmentation and reassembly at the IPv6 layer. I am not a great fan of that proposal, largely because of the many arguments against IP level fragmentation explained in RFC8900 and the security issues described in RFC7739. I think it is a bad idea to try perform segmentation and reassembly outside the protection of encryption. But then, what about the argument that using larger packet sizes, with or without network level fragmentation and reassembly, would improve performance?

Fred Templin reports in his draft that tests with UDP and TCP performance tools show improvements with larger packet sizes, but that the “qperf” tool does not show any such gain for QUIC, because “‘qperf’ limits its packet sizes to 1280 octets.” Fred then concludes that this is the reason why in his benchmarks QUIC is much slower than TCP or raw UDP. For the record, the Quic Perf protocol specification does not mandate a packet size. Many QUIC implementations do implement path MTU discovery, a.k.a., DPLPMTUD and could use large packet sizes. If they did so, would that improve performance?

Some QUIC implementations have tried have tried larger packet sizes. A prominent example is the “litespeed” implementation, which did describe how to improve QUIC performance with DPLPMTUD. They found that increasing the packet size from 1280 to 4096 results in a sizeable performance gain, from 400Mbps to over 600Mbps. But we should qualify: this gains were observed in their specific implementation. Other implementations are using different approaches.

Processing of packets goes through multiple steps such as preparing QUIC headers, processing and sending control frames, copying data frames, encrypting and decrypting data. We analyze the performance of implementations using tools like “flame graph” to measure which components of the software consume the most CPU. In 2019, everybody pretty much assumed that the most onerous component was encryption. We were wrong, and somewhat shocked to observe that 70 to 80% of the CPU load was consumed in the socket layer, in calls to ‘sendto’ or ‘recvfrom’, or maybe ‘sendmsg’ and ‘recvmsg’.

The study by Alessandro Ghedini documented in a [Cloudflare blog] (https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/) was one of the first to show the importance of the UDP Socket API. The blog discusses three successive improvements:

This series of improvements allowed them to drive the performance of their implementation from 640 Mbps to 1.6 Gbps. (The litespeed values were well below that.)

The implementation of QUIC with the highest performance is probably msquic, by a team at Microsoft.
They publish performance reports in a variety of environments, routinely showing data rates of 6 to 7 Gbps. On top of the improvements described by Alessandro Ghedini, the Msquic implementation uses multiple CPU threads, and pays great attention to details of memory allocation, flow control parameters, or cryptographic APIs, as detailed by Nick Banks in this report.

I followed pretty much the same recipes to drive the performance of the picoquic implementation to between 1.5 and 2Gbps. Contrary to Cloudflare, the picoquic implementation relies only on sendmsg and GSO, not sendmmsg – but I spent a good amount of time studying the interation between congestion control, flow control, pacing, and the formation of packet trains. A difference with msquic is that the picoquic implementation uses a single thread, the idea being that deployment that want higher performance can run several servers in parallel, each in its own thread, and balance the incoming load.

Neither Cloudflare nor Microsoft conditions their performance on changing the packet size. One reason is that even with DPLPMTUD, the packet size is limited by the maximum packet size that the receiver accepts – apparently 1476 bytes when the receiver is a Chromium browser. Instead, they rely on sending “packet trains” and minimizing the cost of using the UDP socket API.

Once the implementation starts using packet trains, the size of individual packets matters much less. You do get some benefit from not running the packet composition tasks so often, but these are relatively minor components of the CPU load. There is a small per packet overhead for encryption, but the bulk of the CPU load is proportional to the amount of bytes being encrypted, which means that larger packets only bring modest benefits. The main value is the reduction of per packet overhead: 40 bytes for IPv6, 16 bytes for UDP, maybe 12 bytes for QUIC, 16 bytes for the AEAD checksum, that’s 84 bytes out of 1500 hundred bytes packet, about 6%. It would only be 2.1% of a 4096 bytes packet, a gain in performance of about 4%.

To answer the initial question, I do not believe that a modest performance gain of 4% or even 6% is sufficient to justify the security risk of running fragmentation and reassembly at the IPv6 layer. Fragmentation not protected by encryption open the path for denial of service attacks in because spoofed fragments cannot be easily distinguished from valid fragments. The performance loss from these DOS attacks would be much higher than 6%!

Suspending the exponential backoff

Posted on 27 Nov 2023

I just spent some time revisiting how the picoquic code handles failing links. The QUIC specification in RFC 9002 makes reference to exponential backoff. If a node does not obtain any ACK for some time, the “Probe timeout” (PTO) fires. The node sends an extra packet to “probe” the state of the path, and solicit an ACK. If the ACK arrive, it should acknowledge all the pending packets that have been received, and the stack can start repeating those that were not. If the ACK does not arrive, the node sends a new probe and doubles the timeout. The node will keep waiting and doubling the timeout until it gives up, and consider the connection broken. This is known as the “exponential backoff” algorithm.

The exponential backoff algorithm is considered essential for protecting against congestion. If a path is congested, we want all connection sharing the bottleneck to quickly slow down in order to ease the congestion. That’s fine, but there is an underlying hypothesis that the packet losses are caused by either congestion or line cuts, and that in both cases sending feer packets is good. But we very fequently see other kind of cuts in wireless, for example when a mobile device looses wireless signal temporarily.

Exponential backoff is a bad way to deal with suspension of service. The timeout grows ever larger, and thus the probes for connectivity ever less frequent. On average, the first probe to go through will arrive half a timeout after the suspension stops. Take the example in which the suspension lasts 10 seconds, and the initial timeout is 1s. We will get the following sequence showing a 6 second delay between the repair of the connectivity and the first successful probe.

Time Event
0 connectivity suspended
1 Probe #1, fails
2 Probe #2, fails
4 Probe #3, fails
8 Probe #4, fails
10 connectivity repaired
16 Probe #5, succeeds 6 seconds “too late”

Intuitively, it seems that the first steps of doubling the timeout make sense. At that point the nodes don’t know whether connectivity has failed, or whether packet losses are due to a congestion event that is causing very large delays. Waiting twice of four times longer makes sense, because if there is congestion slowing traffic is the right response. But if no response comes after several trials, the path is probably experiencing a cut or a suspension. In that case, the timeout should be kept reasonably short, so the repair of the path is detected quickly. The question of course is, how short? What would be the reasonable value of a maximum PTO? I argue here that it should be a fraction of the “idle timeout”.

QUIC sessions start by the negotiation of an “idle timeout” between client and server. The idle timeout determines how long each endpoint waits before considering that the connection has failed. It determines how long the endpoints will wait before giving up on the connection. There is no point in computing a PTO that would extend after the idle timeout fires. But the idle timeout also determines the longest suspension that the application will tolerate. Setting the maximum PTO to a fraction of the idle timeout limits the “wasted time” between connectivity repair and first successful probe to a value in line with the maximum suspension that the application will accept.

The inverse of the fraction determines the number of probes sent before the node gives up. Sending too many probes is not desirable, if only because it uses too much resource like battery power. Plausible values could be something like 1/8th or 1/16th, maybe with some safety to guarantee that the timeout is always greater than the original PTO. In the latest version of Picoquic, I picked 1/16th, and all the tests seem to be passing fine, including the tests checking detection of connectivity loss and repair.

I don’t think this going away from full exponential backoff will cause the congestion collapse of the Internet.