The list of past posts including those previously published on Wordpress is available here.
In a previous post I observed that the classic way to compute RTT statistics and retransmission timers does not work well. When acknowledgements are too frequent, the correlation between successive RTT measurements causes the smoothed RTT estimate to track closely the last values, and the RTT variations to be widely underestimated. This imprecision impacts both the effectiveness of loss recovery and the implementation of delay based congestion algorithms like BBR. I think this can be fixed by updating the variables just “once per RTT”. In this paper, I first describe the once per RTT update algorithm, then report on promising measurements using the simulations in the Picoquic test suite.
The goal of the algorithm is to update the
and PTO variables just once per RTT.
The once per RTT update algorithm operates as follow:
smoothed_rtt = kInitialRtt rttvar = kInitialRtt / 2
rtt_sample_number_in_epoch = 0 rtt_epoch_pn = last_packet_number_sent + 1 rtt_epoch_nb_samples = 0 rtt_epoch_sum_samples = 0 rtt_epoch_sample_max = 0 rtt_epoch_sample_min = UINT64_MAX;
smoothed_rtt = latest_rtt rttvar = latest_rtt / 2 min_rtt = latest_rtt
adjusted_rtt = latest_rtt if (latest_rtt >= min_rtt + ack_delay): adjusted_rtt = latest_rtt - ack_delay rtt_epoch_nb_samples += 1 rtt_epoch_sum_samples += adjusted_rtt rtt_epoch_sample_max = max(adjusted_rtt, rtt_epoch_sample_max) rtt_epoch_sample_min = min(adjusted_rtt, rtt_epoch_sample_min) min_rtt = min(min_rtt, adjusted_rtt)
rttvar_sampleis set to the maximum value of the difference between the
smoothed_rttand the maximum or minimum RTT seen during the epoch.
rttvar_sample = max(abs(rtt_epoch_sample_max - smoothed_rtt), abs(rtt_epoch_sample_min - smoothed_rtt)) rttvar = 3/4 * rttvar + 1/4 * rttvar_sample rtt_epoch_estimate = rtt_epoch_sum_samples / rtt_epoch_nb_samples smoothed_rtt = 7/8 * smoothed_rtt + 1/8 * rtt_epoch_estimate PTO = smoothed_rtt + 3*rttvar + max_ack_delay
Then, reinialize the per-epoch variables.
After coding the once RTT algorithm in picoquic I ran the picoquic test suite to evaluate the results. The test suite include a couple hundred simulations of QUIC connections under various conditions. Each of these tests verifies result such as expected simulated time for completing the smulated transfers, or sometimes maximum number of retransmission errors or maximum number of spurious retransmissions. All these tests passed, which means that the new algorithm was well within the tolerances set for the new algorithms. Looking at the details, I saw that:
The big differences happened with tests simulating connection events such as link breaks in multipath scenarios. In these cases, having a better estimate of the RTT helps react sooner. In single path scenarios link breaks result in connection failures and the measurements are not very interesting, but the test suite includes “MTU Drop” tests that simulate a change in the path MTU size, and measure how long it takes to ract to the event and continue the session with the new MTU. This reaction is also very dependent on RTT estimates, and I analyzed it in details.
The “MTU Drop” test simulates the download of a 1MB file over a 1Mbps link, with the transmission delay set at 100ms in each direction, for a nominal RTT of 200ms. Initially, the path MTU is set to 1500 bytes. The code simulates the establishment of the connection, then waiting for 1 second, and verifying that path MTU discovery was performed, and that the connection is using a packet size of 1440 bytes. At that point, the path MTU is reset to 1394 bytes. The test verifies that the code notices the drop in size, and renegotiates an appropriate path MTU. The test verifies that the download completes in the expected time.
The goal of the test is to verify the PMTU adaptation, but that adaptation depends on the packet loss recovery algorithm, which itself depends on the computation of the RTT and the retransmission timer. We evaluate two ways to computing these timers:
When using the once per RTT computation, we evaluate tree variants:
Once 4V, in which the retransmission timer is computed as specified in RFC 9002, adding 4 times “RTT_VAR” and the ACK delay to the smoothed RTT value
Once MaxRTT, in which the retransmission timer is set to the MaxRTT plus RTT_VAR plus the ACK delay
Once 3V, which is the same as Once 4V, but only adding 3 times “RTT_VAR” instead of 4.
The results of the four test runs are summarized in the following table:
We see that the “Once 3V” performs better than the classic implementation: the download completes faster, and there are significantly fewer packet losses.
It is hard to understand exactly why one specific tuning works better than others. When looking at “Once 3V” versus “Classic”, it seems that most of the difference happened during the initial “start up” phase. The log of the “Once 3V” trial shows it exiting start-up 631ms after the beginning of the connection, with the “smoothed RTT” set at 224ms. The “Classic” log shows it exiting start up 991ms after the start, with the smoothed RTT set at 353ms. By the time the MTU drop is simulated, the “classic” connection has the smoothed RTT set at 407ms, while the “once 3V” connection has the RTT at 217ms. The larger RTT means that the “classic” connection takes almost twice as long to detect the packet losses caused by the MTU drop and recovers almost twice slower.
That same reasoning explains why setting the retransmission timer from the Max RTT does not work so well. In our samples, the maximum observed RTT happens at the end of start up, before stabilizing to lower values. The retransmission timer set from Max RTT is too large, and losses are not detected quickly enough.
How many deviations in the timer?
We thus arrive to the notion that “large retransmission timers slow down recovery”. That’s the observation that pushes us to deviate from the original formula:
retransmission timer = smoothed_RTT + 4 * rtt_var + ack_delay
That formula seems pretty arbitrary. It is related to the classic observations that for a Gaussian distribution, observations over 4 standard deviations away from the average happen less than 0.01% of the time, which seems a good way to minimize “spurious packet loss detection”. But it is far from clear that the distribution of RTT samples is Gaussian, and the computation of rtt_var is only a vague approximation of computing a standard deviation. In any case, the use of N times rtt_var is a tradeoff, fewer spurious detection versus longer recovery time from errors.
In QUIC, most loss detections are based on packet number comparisons, not timers. This, alone, renders spurious loss detections fairly rare. In addition, congestion control algorithms like BBR recover much better from spurious loss detection than old algorithms like RENO. It is probably time to revisit the tradeoff. Our tests show that using a coefficent 3 rather than 4 speeds up error detection without visible side effect.
The “once per RTT” algorithm design appears sound, but the evaluations are based on simulations. The next step is to works with other QUIC developers and see whether the algorithm can be validated in large scale deployments.
Last week, the QUIC Working Group was reviewing an errata for RFC
9002, the description of loss recovery
and congestion control for QUIC. There was an error in the description of the algorithm
used to compute the variable
rttvar, which describes the expected variation of the
round trip time (RTT). The order of instructions was wrong, leading to underestimating
rttvar by 1/8th. That’s of course a bug, but the discussion showed that most
implementations had already fixed their code. The discussion also generated quite a few
comments on the algorithm itself. It dates from 1988, and it makes hidden hypotheses
about the distribution and randomness of RTT measurements, which are wrong, essentially
because RTT measurements are very correlated, leading to way more than a 1/8th error.
RFC 9002 describes the loss and congestion control algorithm recommended for QUIC. It includes the computation of RTT and RTTVAR, using the algorithm described in RFC6298, which is itself copied from the TCP specification updated in 1988 by Van Jacobson:
rttvar_sample = abs(smoothed_rtt - rtt_sample) rttvar = 3/4 * rttvar + 1/4 * rttvar_sample smoothed_rtt = 7/8 * smoothed_rtt + 1/8 * rtt_sample
These variables are then used to compute the Retransmission Time Out (in TCP) or the Probe Time Out (in QUIC), using formulas such as:
PTO = smoothed_rtt + max(4*rttvar, timer_granularity) + max_ack_delay
The formulas have two big advantages: they can be computed in a few instructions, which mattered a lot in 1988, and they provide a better evaluation of retransmission delays than the pre-1988 formulas studied by Lixia Zhang in Why TCP Timers don’t work well. But they are making a set of hidden hypotheses:
the computation of the
smoothed_rtt with an exponential filter of
coefficient 1/8 assumes that only the last 8 measurements really matter,
the computation of the
rttvar pretty much assumes that successive measurements are
the computation of RTO or PTO assumes that
rttvar is an approximation of the standard
deviation of the RTT, and that the distribution of RTT can be described by some kind of
regular bell curve.
In practice, these hypotheses are not verified. Look for example at graph above, showing
the evolution of the RTT in a typical QUIC connection. The RTT measurements do not
actually follow a random distribution governed by a bell curve. There is a big
correlation between successive samples, which causes the
rttvar to converge to a very
We also see that the
smoothed_rtt variable closely tracks the measurements. In this
example, the connection was running at about 75Mbps, the RTT was about 140ms, and
there were usually more than 100 packets in flight. The client implementation was sending
frequent acknowledgements, which means that the exponential filter was averaging the RTT
measurements over a fraction of the RTT.
We also see that the RTT measurements vary mostly with the sending rate of the server. We see a succession of phases:
the very first measurements during the initial handshake are somewhat higher than average, probably because they incorporate the computing delays caused by the cryptographic processes,
the RTT measurements then grow quickly during the “slow start” ramp-up, until the server acquires an estimate of the connection bandwidth,
In the next period, the server follows the BBR algorithm, which periodically probes for a higher bandwidth, with each probe resulting in a slight increase of the RTT.
If the server was using Reno, the third phase would exhibit a “saw tooth” pattern caused
by the “congestion avoidance” algorithm. If it was using Cubic, we would see the periodic
spikes of the RTT at the end of each Cubic “epoch”. But just like with BBR, we would see
a high correlation between successive samples. In all these cases, the combination of short
term smoothed RTT and minimal
rttvar is a poor predictor of the next transmission
In the example, there is obviously no competing connection sharing the same path. This is a pretty common occurence, but there will be connections in which competition happens. Suppose for example that a new connection starts with identical characteristics. The new connection will experience high RTTs during its startup phase, and the old connection will also see these higher RTTs. The retransmission timers will not account for such variability, which can easily lead to spurious retransmissions.
I think the timer computation should be fixed to provide robust estimates of the
retransmission delays even if acknowledgements are too frequent and RTT samples are
correlated. We should also find a way to estimate the maximum plausible delay without
making undue hypotheses about the statistical distribution of the delay samples. We
should probably replace the current
rttvar by a combination of
variables, such as:
a short term
smoothed rtt, designed to detect short term variations of the RTT
while smoothing out random variations. See for example the discussion of
Hystart and wireless links in this blog from
a longer term version of the smoother RTT, designed to provide estimate of the average RTT and its variation over sveral round trips, independently of the rate of acknowledgements.
an estimate of the maximum RTT over the recent past, from which we could derive the retransmission delays.
For this last point, we have to define the period over which the maximum RTT shall be evaluated. Maintaining a maximum since the beginning of the connection is plausible but probably leads to over-estimates for connections of long duration. Given that the RTT variations are mostly driven by the congestion control algorithm, the number of RTT over which to compute statistics should probably depend on the algorithm. For BBR, this means covering a complete cycle, including the probing epochs. For Reno, that probably means covering at least a couple of the “saw teeth”. For Cubic, probably a couple of epochs. The simple solution could be something like computing a max delay over a rolling series of recent epochs, and using that as the RTO or PTO.
Luckily, in most cases, the poor quality of the PTO estimate does not affect QUIC performance too much: most packet losses are detected because further packets were acknowledged first, per section 6.1 of RFC 9002. But having a correct value of the PTO still matters in two cases: if the last packet is an exchange is lost, or if connectivity on a path is lost.
This needs experimenting and collecting results. But first it means that, 35 years after 1988, we should recognize that QUIC timers do not work well.
About two weeks ago, I was told by developers of “Media over QUIC” that there was an issue when running over Wi-Fi. After a few seconds, there would be some kind of event, triggering congestion control implemented in Picoquic to reduce the bandwidth, and then resulting in pretty bad performance. It seems due to issues with the Wi-Fi driver on the Mac, as I wrote in a toot on Mastodon. Now that I am less busy with other projects, I have the time to measure the issue in details.
The figure above shows the evolution of the round trip time (RTT) between two computers in my office: an iMac running macOS Ventura 13.3.1, and a Dell laptop running Windows 11. The measurement were taken with a simple program that was generating UDP packets every 20ms on the iMac, sending them over WiFi to the laptop, and then receiving an echo from the laptop. The program logged the time at which the packet was sent, the time at which the laptop sent the echo, and the time at which the echo was received. The RTT is of course measured as the difference between the time the packet was sent and the time the echo was received.
The RTT versus Time graph shows that most RTT samples are rather short, a few milliseconds — the median RTT is 4.04 milliseconds, and 95% of samples are echoed in less than 8ms. Out of 30,000 packets sent in 10 minutes, 38 were lost, about 0.12%. Some packets take a bit longer, with the 99th percentile at 50.7ms, which is somewhat concerning. But the obvious issues are the 18 spikes on the graph, 18 separate events during which the RTT exceeded 100ms, including 12 events with an RTT above 200ms.
The close-up graph shows a detailed view of a single spike. 14 packets were affected. The first one was lost, the second one was echoed after 250ms, and we see the RTT of the next 12 packets decreasing linearly from 250 ms and 4 ms. Looking at the raw data shows that these 13 packets were received just microseconds apart. Everything happens as if Wi-Fi transmission has been suspended for 250 ms, with packets queued during the suspension and delivered quickly when transmission resumes.
The previous graph looked at a “simple” spike happening 23 seconds after the start of the measurements. Simple events appear as narrow spike in the “time line” graph. Some events are more complex. They appear on the graph as a combination of adjacent line.
The next graph shows a close up of a series of spikes happening at short intervals. There are 14 such spikes, spread over a 3 seconds interval. Each spike has the same structure as the single spike described above: the network transmission appears to stop for an interval, and then packets are delivered. In one case, two spikes overlap. Spikes may have different intervals, between 50 ms and 280 ms.
The RTT is the sum of two one-way delays: from the Mac to the PC, and back. The previous analysis concludes that the spikes happen when transmission stops, but that could be transmission from the Mac or from the PC. The one way delay trap shows that it actually happens in both directions. Out of the 18 spikes in the RTT timeline graph, 11 happens because transmission stopped on the Mac, 3 because it stopped on the PC, and 4 because it stopped on both. It seems that PC and Mac have similar Wi-Fi drivers, both creating occasional spikes, but that this happens almost twice as often on the Mac.
At this stage, we don’t know exactly what causes the Wi-Fi drivers to stop transmission. There are two plausible ideas: wireless driver sometime stop in order to save energy; or, wireless drivers sometime stop operation on one frequency band in order to scan the other bands and locate alternative Wi-Fi routers. Out of the two, the scanning hypothesis is the most likely. It would explain the “series of spikes” patterns, with the wifi radio briefly returning to the nominal frequency band between scans of multiple bands.
My next task will be to see how the QUIC stack in Picoquic can be adapted to mitigate the effects of this Wi-Fi behavior, for example by returning quickly to nominal conditions after the end of a spike. But the best mitigation won’t help the fact that shutting down radios for a quarter of a second does nothing good to end to end latency. VoIP over Wi-Fi is going to not sound very good. The issue is for our colleagues at Apple and Microsoft to fix!
I just implemented in Picoquic the new ACK processing algorithm (proposed for QUIC multipath)[https://github.com/quicwg/multipath/pull/217], which processes ACK independently of the path over which they arrive. It looked good, but there was an interesting regression. The tests that simulated transmission over satellite links were failing. The previous version showed a file transfer concluding in less than 7 seconds, but the with the new version it took about 10 seconds. That was strange, since the only changed was the computation of the round trip time, and the logs showed that both versions computed the same value. To solve that, I had to take a look at traces captured in log files. The traces of the new execution looked reasonable, as shown in this graph:
We see the transmission accelerating steadily, just as expected from an implementation of the slow-start algorithm. The curve is very smooth. The congestion window double once per RTT, until it becomes large enough to saturate the simulated link, after about 7.5 seconds. And then there is a tail of transmission, including retransmission of packets sent at the end of the slow start period, for a total duration of almost 10 seconds. But the previous version was actually completing the transfer much faster, as seen in this other graph:
Spot the difference? The old curve was not as smooth. We see a series of segments at progressively higher speed, often starting with a vertical line that indicates many packets sent in quick successions. These packet trains will be received by the other end at close to line speed, and the arrival of ACKs will reflect that speed. Picoquic uses that information to compute an estimate of the path capacity. This allows the congestion window to grow much faster than the regular slow start algorithm, allowing the whole transmission to last fewer than 7 seconds. But what did this not happen in the new variant?
If you compare the very beginning of the two curves, you will notice the small vertical lines at the beginning of each new roundtrip period in the old curve. They are missing in the new curve. It is hard to point the exact cause, but some detail changed, and then the whole behavior changed. That reminded me of the old story about a butterfly flapping its wings on a Pacific Island, and next thing you know there is a typhoon approaching Hawaii. There was no butterfly here, just probably a tiny change in the timing and sequence of packets, but then the connection fell in a pattern where pacing enforce a form of ACK clocking, and the code never had a chance to properly estimate the bandwidth of the path.
I fixed it by forcing a pacing pause if the bandwidth estimation fails during slow start. The transmission only restarts after enough pacing credits have been accumulated for sending a long enough train. With that, the tests do complete in less than 7 seconds. But I am glad that the tests exposed the issue, which was indeed a bug. The butterfly flapping its wing and causing a typhoon is a metaphor of chaotic systems, in which tiny changes can have unforeseen consequences. The code behavior exposed here was chaotic, and that’s not good. Code should be predictable, and behavior should never be left to chance!
A friend, Marc Blanchet, asked me last December whether it would be possible to use QUIC in space. Sure, the delays would be longer, but in theory it should be possible to scale the various time-related constants in the protocol, and then everything else should work. I waited to have some free time, and then I took the challenge, running a couple of simulations to see how Picoquic would behave on space links, such as between the Earth and Mars. I had already tested Picoquic on links with a 10 second round trip time (RTT), so there was hope.
First, I tried a simulation with a one minute one-way delay. A bit short of Mars, but a good first step. Of course, the first trial did not work, because Picoquic was programmed with a “handshake completion timer” of 30 seconds, and the Picoquic server was enforcing a maximum idle timer of 2 minutes. There was already an API to set the idle timer, so I used it to set a value of at least 3 times the maximum supported RTT. Then, I updated the code to keep the handshake going until the largest of the 30 second default timer and the value of the idle timer. And, success, the handshake did work in the simulation. However, it was very noisy.
At the beginning of the connection, client and servers do not know the RTT. The QUIC spec says to repeat the Initial packet if a response does not arrive within a timer, starting with a short initial timer value (Picoquic uses 250ms), and doubling that value after every repeat. That’s a good exploration, but Picoquic capped the timer at 1 second, so there are enough trials on average to succeed in front of 30% packet loss — which meant repeating the Initial packet more than 120 times in our case. The fix was to make that cap a fraction of the idle timer value, with limit to about a dozen transmissions in our test. Still big, but acceptable.
After the handshake things get better, because both ends at that point have measured the RTT at least one. Most timer values used in the data transmission phase are proportional to this RTT, and they naturally adapt. The usual concern with long delay links is the duration of the slow start phase, during which the sender gradually increases the sending rate until the path bandwidth is assessed. The sending rate starts at a low value and is doubled every RTT, but for a 10 Mbps link that might require 5 or 6 RTT. In our case, that would be 12 minutes before reaching full efficiency, which would not be good. But Picoquic already knew how to cope with that, because it was already tested on satellite links.
Picoquic uses “chirping” to rapidly discover the path capacity. During the first RTT, Picoquic sends a small train of packets, measures the time between first and last acknowledgement for that train, and gets a gross estimate of the link data rate. It then uses that estimate to accelerate the start-up algorithm (Picoquic uses Hystart), by propping up the sending rate. That works quite well for our long distance links, and we reach reasonable usage in 3 RTT instead of 5. It could work even better if Picoquic used the full estimate provided by chirping, or maybe derived from a previous connection, but estimates could be wrong and we limit potential issues by only using half their value.
Chirping takes care of congestion control, at least during startup, but we also have to consider flow control. If the client asks to “Get this 100MB file” but the flow control allows only 1MB, the transmission on very long delay link is going to take a very long time. But if the client says something like “get this 100MB file and, by the way, here are an extra 100MB of flow control credits”, the transmission will happen much faster. This is what we do in the tests, but it will have to be somehow automated in the practical deployments.
Once we have solved congestion control and flow control, we need to worry about timers. In QUIC, most timers are proportional to the RTT, but a few are not. The idle timer is preset before the measurement, as discussed above. The BBR algorithm specifies a “probe RTT” interval of 10 seconds, which would not be good, but Picoquic was already programmed to use the max of that and 3 RTT. The main issue in the simulation was the “retire connection ID (CID)” interval.
Picoquic is programmed to switch to a CID if resuming transmission after a long silence. This is a privacy feature, because long silences often trigger a NAT rebinding. Changing the CID makes it harder for on path observers to correlate the newly observed packets to the previous connection. However, the “long silence” was defined as 5 seconds, which is way to short in our case. We had to change that and define it as the largest of 5 seconds and 3 times the RTT.
With these changes, our “60 seconds delay” experiment was successful. That was a happy result, but Marc pointed out that 60 seconds is not that long. It takes more than 3 minutes to send a signal from Earth to Mars when Mars is at the closest distance, and 22 minutes when Mars is at the furthest. Sending signals to Jupiter takes 32 minutes to almost an hour, and to Saturn more than an hour. What if we repeated the experiment by simulating a 20 minute delay? Would things explode?
In theory, the code was ready for this 20 minute trial, but in practice it did in fact explode. Picoquic measures time in microseconds. 20 minutes is 1,200,000,000 microseconds. Multiply by 4 and you get a number that does not fit on 32 bits! The tests quickly surfaced these issues, and they had to be fixed. But after those fixes the transmissions worked as expected.
I don’t know whether Picoquic will in fact be used in spaceships, but I found the exercise quite interesting. It reinforces my conviction that “if it is not tested, it does not work”. A bunch of little issues were found, which overall make the code more robust. And, well, one can always dream that QUIC will one day be used for transmissions between Earth and Mars.