07 Feb 2023
A friend, Marc Blanchet, asked me last December whether it would be possible to use QUIC in space. Sure, the delays would be longer, but in theory it should be possible to scale the various time-related constants in the protocol, and then everything else should work. I waited to have some free time, and then I took the challenge, running a couple of simulations to see how Picoquic would behave on space links, such as between the Earth and Mars. I had already tested Picoquic on links with a 10 second round trip time (RTT), so there was hope.
First, I tried a simulation with a one minute one-way delay. A bit short of Mars, but a good first step. Of course, the first trial did not work, because Picoquic was programmed with a “handshake completion timer” of 30 seconds, and the Picoquic server was enforcing a maximum idle timer of 2 minutes. There was already an API to set the idle timer, so I used it to set a value of at least 3 times the maximum supported RTT. Then, I updated the code to keep the handshake going until the largest of the 30 second default timer and the value of the idle timer. And, success, the handshake did work in the simulation. However, it was very noisy.
At the beginning of the connection, client and servers do not know the RTT. The QUIC spec says to repeat the Initial packet if a response does not arrive within a timer, starting with a short initial timer value (Picoquic uses 250ms), and doubling that value after every repeat. That’s a good exploration, but Picoquic capped the timer at 1 second, so there are enough trials on average to succeed in front of 30% packet loss — which meant repeating the Initial packet more than 120 times in our case. The fix was to make that cap a fraction of the idle timer value, with limit to about a dozen transmissions in our test. Still big, but acceptable.
After the handshake things get better, because both ends at that point have measured the RTT at least one. Most timer values used in the data transmission phase are proportional to this RTT, and they naturally adapt. The usual concern with long delay links is the duration of the slow start phase, during which the sender gradually increases the sending rate until the path bandwidth is assessed. The sending rate starts at a low value and is doubled every RTT, but for a 10 Mbps link that might require 5 or 6 RTT. In our case, that would be 12 minutes before reaching full efficiency, which would not be good. But Picoquic already knew how to cope with that, because it was already tested on satellite links.
Picoquic uses “chirping” to rapidly discover the path capacity. During the first RTT, Picoquic sends a small train of packets, measures the time between first and last acknowledgement for that train, and gets a gross estimate of the link data rate. It then uses that estimate to accelerate the start-up algorithm (Picoquic uses Hystart), by propping up the sending rate. That works quite well for our long distance links, and we reach reasonable usage in 3 RTT instead of 5. It could work even better if Picoquic used the full estimate provided by chirping, or maybe derived from a previous connection, but estimates could be wrong and we limit potential issues by only using half their value.
Chirping takes care of congestion control, at least during startup, but we also have to consider flow control. If the client asks to “Get this 100MB file” but the flow control allows only 1MB, the transmission on very long delay link is going to take a very long time. But if the client says something like “get this 100MB file and, by the way, here are an extra 100MB of flow control credits”, the transmission will happen much faster. This is what we do in the tests, but it will have to be somehow automated in the practical deployments.
Once we have solved congestion control and flow control, we need to worry about timers. In QUIC, most timers are proportional to the RTT, but a few are not. The idle timer is preset before the measurement, as discussed above. The BBR algorithm specifies a “probe RTT” interval of 10 seconds, which would not be good, but Picoquic was already programmed to use the max of that and 3 RTT. The main issue in the simulation was the “retire connection ID (CID)” interval.
Picoquic is programmed to switch to a CID if resuming transmission after a long silence. This is a privacy feature, because long silences often trigger a NAT rebinding. Changing the CID makes it harder for on path observers to correlate the newly observed packets to the previous connection. However, the “long silence” was defined as 5 seconds, which is way to short in our case. We had to change that and define it as the largest of 5 seconds and 3 times the RTT.
With these changes, our “60 seconds delay” experiment was successful. That was a happy result, but Marc pointed out that 60 seconds is not that long. It takes more than 3 minutes to send a signal from Earth to Mars when Mars is at the closest distance, and 22 minutes when Mars is at the furthest. Sending signals to Jupiter takes 32 minutes to almost an hour, and to Saturn more than an hour. What if we repeated the experiment by simulating a 20 minute delay? Would things explode?
In theory, the code was ready for this 20 minute trial, but in practice it did in fact explode. Picoquic measures time in microseconds. 20 minutes is 1,200,000,000 microseconds. Multiply by 4 and you get a number that does not fit on 32 bits! The tests quickly surfaced these issues, and they had to be fixed. But after those fixes the transmissions worked as expected.
I don’t know whether Picoquic will in fact be used in spaceships, but I found the exercise quite interesting. It reinforces my conviction that “if it is not tested, it does not work”. A bunch of little issues were found, which overall make the code more robust. And, well, one can always dream that QUIC will one day be used for transmissions between Earth and Mars.
You can use a Mastodon account to comment on this article by replying to the associated Mastodon toot: