Christian Huitema's Latest Posts

Cloudy sky, waves on the sea, the sun is 
shining

The list of past posts including those previously published on Wordpress is available here.

QUIC timeouts and Handshake Interop

Posted on 19 Jan 2024

Marten Seemann made a great contribution to QUIC interoperability by setting the QUIC interop runner. The site runs series of interoperability tests between participating QUIC implementations (17 of them when I am writing this) and reports that in a large result matrix. It is a nice complement to the internal tests of the implementations, and it was flagging an interesting issue: the test L1 was failing between ngtcp2 client and picoquic server.

The test codenamed L1 verifies that implementations can successfully establish connections in presence of high packet loss. The test consists of 50 successive connection attempts, followed by the download of a short 1KB document. The connections are run over a network simulation programmed to drop 30% of packets. The test succeeds if all connections succeed and all 50 documents are retrieved.

In the “ngtcp2 to picoquic” tests, all documents were properly downloaded, but the analysis of traffic showed 51 connection attempts instead of the expected 50, and thus the test was marked failing. It took me a while to parse the various logs and understand why this was happening, but it turned out to be a timeout issue. One of the 50 tests ran like this:

Nobody is really at fault here — NGTCP2 behaves exactly as the standard mandates, and it is perfectly legal for the Picoquic server to drop contexts after absence of activity for some period. In fact, servers should to do just that in case of DOS attacks. But explaining to the testers that “we are failing your test because it is too picky” is kind of hard. There was a simpler fix: just configure Picoquic to use longer timers, 180 seconds instead of 30. With that, the context is still present when the finally successful repeat packet arrives. Picouic creates just one connection, and everybody is happy.

But still, Picoquic was using a short handshake timer for a reason: if connections are failing, it makes sense to clean them up quickly. The L1 test between Picoquic client and server was passing despite the short timers, because Picoquic’s loss recovery process is more aggressive than what the standard specifies. The standard specifies a conservative strategy that uses “exponential backoff”, doubling the value of the timer after each failure, for the following timeline:

Time (standard) Number Timeout(ms)
0 1 300
300 2 600
900 3 1200
2100 4 2400
4500 5 4800
9300 6 9600
18700 7 19200
37900 8 38400
76300 9 76800

Picoquic deviates from that strategy, as discussed in Suspending the Exponential Backoff. The timeline is much more aggressive:

Time (picoquic) Number Timeout(ms)  
0 1 250  
250 2 250 Not doubling on first PTO
500 3 500  
1000 4 1000  
2000 5 1875 Cap timer to 1/16th of 30s timer
3875 6 1875  
5750 7 1875  
7625 8 1875  
9500 9 1875  

After configuring the handshake timer to 180 seconds, the Picoquic sequence is still more aggressive than the standard, but the difference is smaller:

Time (picoquic) Number Timeout(ms)  
0 1 250  
250 2 250 Not doubling on first PTO
500 3 500  
1000 4 1000  
2000 5 2000  
6000 6 4000  
10000 7 8000  
18000 8 11250 Cap timer to 1/16th of 180s timer
29250 9 11250  

In our test, it seems that not being much more aggressive than the peer did result in the behavior that the testers expected. In real life, I think that the intuitions developed in the previous blog still hold. It is just that for the test, we have to please the protocol police…

Optimizing QUIC performance

Posted on 12 Dec 2023

I am following the discussions on the IETF mailing lists, and I got engaged in a review of a proposal by Fred templin to better implement packet fragmentation and reassembly at the IPv6 layer. I am not a great fan of that proposal, largely because of the many arguments against IP level fragmentation explained in RFC8900 and the security issues described in RFC7739. I think it is a bad idea to try perform segmentation and reassembly outside the protection of encryption. But then, what about the argument that using larger packet sizes, with or without network level fragmentation and reassembly, would improve performance?

Fred Templin reports in his draft that tests with UDP and TCP performance tools show improvements with larger packet sizes, but that the “qperf” tool does not show any such gain for QUIC, because “‘qperf’ limits its packet sizes to 1280 octets.” Fred then concludes that this is the reason why in his benchmarks QUIC is much slower than TCP or raw UDP. For the record, the Quic Perf protocol specification does not mandate a packet size. Many QUIC implementations do implement path MTU discovery, a.k.a., DPLPMTUD and could use large packet sizes. If they did so, would that improve performance?

Some QUIC implementations have tried have tried larger packet sizes. A prominent example is the “litespeed” implementation, which did describe how to improve QUIC performance with DPLPMTUD. They found that increasing the packet size from 1280 to 4096 results in a sizeable performance gain, from 400Mbps to over 600Mbps. But we should qualify: this gains were observed in their specific implementation. Other implementations are using different approaches.

Processing of packets goes through multiple steps such as preparing QUIC headers, processing and sending control frames, copying data frames, encrypting and decrypting data. We analyze the performance of implementations using tools like “flame graph” to measure which components of the software consume the most CPU. In 2019, everybody pretty much assumed that the most onerous component was encryption. We were wrong, and somewhat shocked to observe that 70 to 80% of the CPU load was consumed in the socket layer, in calls to ‘sendto’ or ‘recvfrom’, or maybe ‘sendmsg’ and ‘recvmsg’.

The study by Alessandro Ghedini documented in a [Cloudflare blog] (https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/) was one of the first to show the importance of the UDP Socket API. The blog discusses three successive improvements:

This series of improvements allowed them to drive the performance of their implementation from 640 Mbps to 1.6 Gbps. (The litespeed values were well below that.)

The implementation of QUIC with the highest performance is probably msquic, by a team at Microsoft.
They publish performance reports in a variety of environments, routinely showing data rates of 6 to 7 Gbps. On top of the improvements described by Alessandro Ghedini, the Msquic implementation uses multiple CPU threads, and pays great attention to details of memory allocation, flow control parameters, or cryptographic APIs, as detailed by Nick Banks in this report.

I followed pretty much the same recipes to drive the performance of the picoquic implementation to between 1.5 and 2Gbps. Contrary to Cloudflare, the picoquic implementation relies only on sendmsg and GSO, not sendmmsg – but I spent a good amount of time studying the interation between congestion control, flow control, pacing, and the formation of packet trains. A difference with msquic is that the picoquic implementation uses a single thread, the idea being that deployment that want higher performance can run several servers in parallel, each in its own thread, and balance the incoming load.

Neither Cloudflare nor Microsoft conditions their performance on changing the packet size. One reason is that even with DPLPMTUD, the packet size is limited by the maximum packet size that the receiver accepts – apparently 1476 bytes when the receiver is a Chromium browser. Instead, they rely on sending “packet trains” and minimizing the cost of using the UDP socket API.

Once the implementation starts using packet trains, the size of individual packets matters much less. You do get some benefit from not running the packet composition tasks so often, but these are relatively minor components of the CPU load. There is a small per packet overhead for encryption, but the bulk of the CPU load is proportional to the amount of bytes being encrypted, which means that larger packets only bring modest benefits. The main value is the reduction of per packet overhead: 40 bytes for IPv6, 16 bytes for UDP, maybe 12 bytes for QUIC, 16 bytes for the AEAD checksum, that’s 84 bytes out of 1500 hundred bytes packet, about 6%. It would only be 2.1% of a 4096 bytes packet, a gain in performance of about 4%.

To answer the initial question, I do not believe that a modest performance gain of 4% or even 6% is sufficient to justify the security risk of running fragmentation and reassembly at the IPv6 layer. Fragmentation not protected by encryption open the path for denial of service attacks in because spoofed fragments cannot be easily distinguished from valid fragments. The performance loss from these DOS attacks would be much higher than 6%!

Suspending the exponential backoff

Posted on 27 Nov 2023

I just spent some time revisiting how the picoquic code handles failing links. The QUIC specification in RFC 9002 makes reference to exponential backoff. If a node does not obtain any ACK for some time, the “Probe timeout” (PTO) fires. The node sends an extra packet to “probe” the state of the path, and solicit an ACK. If the ACK arrive, it should acknowledge all the pending packets that have been received, and the stack can start repeating those that were not. If the ACK does not arrive, the node sends a new probe and doubles the timeout. The node will keep waiting and doubling the timeout until it gives up, and consider the connection broken. This is known as the “exponential backoff” algorithm.

The exponential backoff algorithm is considered essential for protecting against congestion. If a path is congested, we want all connection sharing the bottleneck to quickly slow down in order to ease the congestion. That’s fine, but there is an underlying hypothesis that the packet losses are caused by either congestion or line cuts, and that in both cases sending feer packets is good. But we very fequently see other kind of cuts in wireless, for example when a mobile device looses wireless signal temporarily.

Exponential backoff is a bad way to deal with suspension of service. The timeout grows ever larger, and thus the probes for connectivity ever less frequent. On average, the first probe to go through will arrive half a timeout after the suspension stops. Take the example in which the suspension lasts 10 seconds, and the initial timeout is 1s. We will get the following sequence showing a 6 second delay between the repair of the connectivity and the first successful probe.

Time Event
0 connectivity suspended
1 Probe #1, fails
2 Probe #2, fails
4 Probe #3, fails
8 Probe #4, fails
10 connectivity repaired
16 Probe #5, succeeds 6 seconds “too late”

Intuitively, it seems that the first steps of doubling the timeout make sense. At that point the nodes don’t know whether connectivity has failed, or whether packet losses are due to a congestion event that is causing very large delays. Waiting twice of four times longer makes sense, because if there is congestion slowing traffic is the right response. But if no response comes after several trials, the path is probably experiencing a cut or a suspension. In that case, the timeout should be kept reasonably short, so the repair of the path is detected quickly. The question of course is, how short? What would be the reasonable value of a maximum PTO? I argue here that it should be a fraction of the “idle timeout”.

QUIC sessions start by the negotiation of an “idle timeout” between client and server. The idle timeout determines how long each endpoint waits before considering that the connection has failed. It determines how long the endpoints will wait before giving up on the connection. There is no point in computing a PTO that would extend after the idle timeout fires. But the idle timeout also determines the longest suspension that the application will tolerate. Setting the maximum PTO to a fraction of the idle timeout limits the “wasted time” between connectivity repair and first successful probe to a value in line with the maximum suspension that the application will accept.

The inverse of the fraction determines the number of probes sent before the node gives up. Sending too many probes is not desirable, if only because it uses too much resource like battery power. Plausible values could be something like 1/8th or 1/16th, maybe with some safety to guarantee that the timeout is always greater than the original PTO. In the latest version of Picoquic, I picked 1/16th, and all the tests seem to be passing fine, including the tests checking detection of connectivity loss and repair.

I don’t think this going away from full exponential backoff will cause the congestion collapse of the Internet.

Update QUIC timers once per RTT

Posted on 22 Jun 2023

In a previous post I observed that the classic way to compute RTT statistics and retransmission timers does not work well. When acknowledgements are too frequent, the correlation between successive RTT measurements causes the smoothed RTT estimate to track closely the last values, and the RTT variations to be widely underestimated. This imprecision impacts both the effectiveness of loss recovery and the implementation of delay based congestion algorithms like BBR. I think this can be fixed by updating the variables just “once per RTT”. In this paper, I first describe the once per RTT update algorithm, then report on promising measurements using the simulations in the Picoquic test suite.

The once per RTT update algorithm

The goal of the algorithm is to update the smoothed_rtt, rttvar and PTO variables just once per RTT. The once per RTT update algorithm operates as follow:

Evaluating the once per RTT algorithm

After coding the once RTT algorithm in picoquic I ran the picoquic test suite to evaluate the results. The test suite include a couple hundred simulations of QUIC connections under various conditions. Each of these tests verifies result such as expected simulated time for completing the smulated transfers, or sometimes maximum number of retransmission errors or maximum number of spurious retransmissions. All these tests passed, which means that the new algorithm was well within the tolerances set for the new algorithms. Looking at the details, I saw that:

The big differences happened with tests simulating connection events such as link breaks in multipath scenarios. In these cases, having a better estimate of the RTT helps react sooner. In single path scenarios link breaks result in connection failures and the measurements are not very interesting, but the test suite includes “MTU Drop” tests that simulate a change in the path MTU size, and measure how long it takes to ract to the event and continue the session with the new MTU. This reaction is also very dependent on RTT estimates, and I analyzed it in details.

Comparison of RTT algorithms in MTU-Drop test

The “MTU Drop” test simulates the download of a 1MB file over a 1Mbps link, with the transmission delay set at 100ms in each direction, for a nominal RTT of 200ms. Initially, the path MTU is set to 1500 bytes. The code simulates the establishment of the connection, then waiting for 1 second, and verifying that path MTU discovery was performed, and that the connection is using a packet size of 1440 bytes. At that point, the path MTU is reset to 1394 bytes. The test verifies that the code notices the drop in size, and renegotiates an appropriate path MTU. The test verifies that the download completes in the expected time.

The goal of the test is to verify the PMTU adaptation, but that adaptation depends on the packet loss recovery algorithm, which itself depends on the computation of the RTT and the retransmission timer. We evaluate two ways to computing these timers:

When using the once per RTT computation, we evaluate tree variants:

The results of the four test runs are summarized in the following table:

RTT-method Completion Delta time% Losses Delta-loss%
Classic 11.3 0.0% 79 0%
Once 4V 10.7 -5.6% 50 -37%
Once MaxRTT 11.5 1.3% 48 -39%
Once 3V 10.5 -7.2% 48 -39%

We see that the “Once 3V” performs better than the classic implementation: the download completes faster, and there are significantly fewer packet losses.

It is hard to understand exactly why one specific tuning works better than others. When looking at “Once 3V” versus “Classic”, it seems that most of the difference happened during the initial “start up” phase. The log of the “Once 3V” trial shows it exiting start-up 631ms after the beginning of the connection, with the “smoothed RTT” set at 224ms. The “Classic” log shows it exiting start up 991ms after the start, with the smoothed RTT set at 353ms. By the time the MTU drop is simulated, the “classic” connection has the smoothed RTT set at 407ms, while the “once 3V” connection has the RTT at 217ms. The larger RTT means that the “classic” connection takes almost twice as long to detect the packet losses caused by the MTU drop and recovers almost twice slower.

That same reasoning explains why setting the retransmission timer from the Max RTT does not work so well. In our samples, the maximum observed RTT happens at the end of start up, before stabilizing to lower values. The retransmission timer set from Max RTT is too large, and losses are not detected quickly enough.

How many deviations in the timer?

We thus arrive to the notion that “large retransmission timers slow down recovery”. That’s the observation that pushes us to deviate from the original formula:

retransmission timer = smoothed_RTT + 4 * rtt_var + ack_delay

That formula seems pretty arbitrary. It is related to the classic observations that for a Gaussian distribution, observations over 4 standard deviations away from the average happen less than 0.01% of the time, which seems a good way to minimize “spurious packet loss detection”. But it is far from clear that the distribution of RTT samples is Gaussian, and the computation of rtt_var is only a vague approximation of computing a standard deviation. In any case, the use of N times rtt_var is a tradeoff, fewer spurious detection versus longer recovery time from errors.

In QUIC, most loss detections are based on packet number comparisons, not timers. This, alone, renders spurious loss detections fairly rare. In addition, congestion control algorithms like BBR recover much better from spurious loss detection than old algorithms like RENO. It is probably time to revisit the tradeoff. Our tests show that using a coefficent 3 rather than 4 speeds up error detection without visible side effect.

Further work

The “once per RTT” algorithm design appears sound, but the evaluations are based on simulations. The next step is to works with other QUIC developers and see whether the algorithm can be validated in large scale deployments.

QUIC timers don’t work well

Posted on 13 Jun 2023

Last week, the QUIC Working Group was reviewing an errata for RFC 9002, the description of loss recovery and congestion control for QUIC. There was an error in the description of the algorithm used to compute the variable rttvar, which describes the expected variation of the round trip time (RTT). The order of instructions was wrong, leading to underestimating the rttvar by 1/8th. That’s of course a bug, but the discussion showed that most implementations had already fixed their code. The discussion also generated quite a few comments on the algorithm itself. It dates from 1988, and it makes hidden hypotheses about the distribution and randomness of RTT measurements, which are wrong, essentially because RTT measurements are very correlated, leading to way more than a 1/8th error.

RFC 9002 describes the loss and congestion control algorithm recommended for QUIC. It includes the computation of RTT and RTTVAR, using the algorithm described in RFC6298, which is itself copied from the TCP specification updated in 1988 by Van Jacobson:

rttvar_sample = abs(smoothed_rtt - rtt_sample)
rttvar = 3/4 * rttvar + 1/4 * rttvar_sample
smoothed_rtt = 7/8 * smoothed_rtt + 1/8 * rtt_sample

These variables are then used to compute the Retransmission Time Out (in TCP) or the Probe Time Out (in QUIC), using formulas such as:

PTO = smoothed_rtt + max(4*rttvar, timer_granularity) + max_ack_delay

The formulas have two big advantages: they can be computed in a few instructions, which mattered a lot in 1988, and they provide a better evaluation of retransmission delays than the pre-1988 formulas studied by Lixia Zhang in Why TCP Timers don’t work well. But they are making a set of hidden hypotheses:

RTT timeline of typical QUIC 
connection

In practice, these hypotheses are not verified. Look for example at graph above, showing the evolution of the RTT in a typical QUIC connection. The RTT measurements do not actually follow a random distribution governed by a bell curve. There is a big correlation between successive samples, which causes the rttvar to converge to a very small value.

We also see that the smoothed_rtt variable closely tracks the measurements. In this example, the connection was running at about 75Mbps, the RTT was about 140ms, and there were usually more than 100 packets in flight. The client implementation was sending frequent acknowledgements, which means that the exponential filter was averaging the RTT measurements over a fraction of the RTT.

We also see that the RTT measurements vary mostly with the sending rate of the server. We see a succession of phases:

If the server was using Reno, the third phase would exhibit a “saw tooth” pattern caused by the “congestion avoidance” algorithm. If it was using Cubic, we would see the periodic spikes of the RTT at the end of each Cubic “epoch”. But just like with BBR, we would see a high correlation between successive samples. In all these cases, the combination of short term smoothed RTT and minimal rttvar is a poor predictor of the next transmission delay.

In the example, there is obviously no competing connection sharing the same path. This is a pretty common occurence, but there will be connections in which competition happens. Suppose for example that a new connection starts with identical characteristics. The new connection will experience high RTTs during its startup phase, and the old connection will also see these higher RTTs. The retransmission timers will not account for such variability, which can easily lead to spurious retransmissions.

I think the timer computation should be fixed to provide robust estimates of the retransmission delays even if acknowledgements are too frequent and RTT samples are correlated. We should also find a way to estimate the maximum plausible delay without making undue hypotheses about the statistical distribution of the delay samples. We should probably replace the current smoothed_rtt and rttvar by a combination of variables, such as:

For this last point, we have to define the period over which the maximum RTT shall be evaluated. Maintaining a maximum since the beginning of the connection is plausible but probably leads to over-estimates for connections of long duration. Given that the RTT variations are mostly driven by the congestion control algorithm, the number of RTT over which to compute statistics should probably depend on the algorithm. For BBR, this means covering a complete cycle, including the probing epochs. For Reno, that probably means covering at least a couple of the “saw teeth”. For Cubic, probably a couple of epochs. The simple solution could be something like computing a max delay over a rolling series of recent epochs, and using that as the RTO or PTO.

Luckily, in most cases, the poor quality of the PTO estimate does not affect QUIC performance too much: most packet losses are detected because further packets were acknowledged first, per section 6.1 of RFC 9002. But having a correct value of the PTO still matters in two cases: if the last packet is an exchange is lost, or if connectivity on a path is lost.

This needs experimenting and collecting results. But first it means that, 35 years after 1988, we should recognize that QUIC timers do not work well.