19 Jan 2024
Marten Seemann made a great contribution to QUIC interoperability by setting the
QUIC interop runner. The site runs series of interoperability
tests between participating QUIC implementations (17 of them when I am writing this) and
reports that in a large result matrix. It is a nice complement to the internal tests
of the implementations, and it was flagging an interesting issue:
the test L1
was failing between ngtcp2 client and picoquic server.
The test codenamed L1
verifies that implementations can successfully establish connections
in presence of high packet loss. The test consists of 50 successive connection attempts,
followed by the download of a short 1KB document. The connections are run over a network
simulation programmed to drop 30% of packets. The test succeeds if all connections succeed
and all 50 documents are retrieved.
In the “ngtcp2 to picoquic” tests, all documents were properly downloaded, but the analysis of traffic showed 51 connection attempts instead of the expected 50, and thus the test was marked failing. It took me a while to parse the various logs and understand why this was happening, but it turned out to be a timeout issue. One of the 50 tests ran like this:
Nobody is really at fault here — NGTCP2 behaves exactly as the standard mandates, and it is perfectly legal for the Picoquic server to drop contexts after absence of activity for some period. In fact, servers should to do just that in case of DOS attacks. But explaining to the testers that “we are failing your test because it is too picky” is kind of hard. There was a simpler fix: just configure Picoquic to use longer timers, 180 seconds instead of 30. With that, the context is still present when the finally successful repeat packet arrives. Picouic creates just one connection, and everybody is happy.
But still, Picoquic was using a short handshake timer for a reason: if connections are failing,
it makes sense to clean them up quickly. The L1
test between Picoquic client and server
was passing despite the short timers, because Picoquic’s loss recovery process is more
aggressive than what the standard specifies. The standard specifies a conservative strategy
that uses “exponential backoff”, doubling the value of the timer after each failure,
for the following timeline:
Time (standard) | Number | Timeout(ms) |
---|---|---|
0 | 1 | 300 |
300 | 2 | 600 |
900 | 3 | 1200 |
2100 | 4 | 2400 |
4500 | 5 | 4800 |
9300 | 6 | 9600 |
18700 | 7 | 19200 |
37900 | 8 | 38400 |
76300 | 9 | 76800 |
Picoquic deviates from that strategy, as discussed in Suspending the Exponential Backoff. The timeline is much more aggressive:
Time (picoquic) | Number | Timeout(ms) | |
---|---|---|---|
0 | 1 | 250 | |
250 | 2 | 250 | Not doubling on first PTO |
500 | 3 | 500 | |
1000 | 4 | 1000 | |
2000 | 5 | 1875 | Cap timer to 1/16th of 30s timer |
3875 | 6 | 1875 | |
5750 | 7 | 1875 | |
7625 | 8 | 1875 | |
9500 | 9 | 1875 |
After configuring the handshake timer to 180 seconds, the Picoquic sequence is still more aggressive than the standard, but the difference is smaller:
Time (picoquic) | Number | Timeout(ms) | |
---|---|---|---|
0 | 1 | 250 | |
250 | 2 | 250 | Not doubling on first PTO |
500 | 3 | 500 | |
1000 | 4 | 1000 | |
2000 | 5 | 2000 | |
6000 | 6 | 4000 | |
10000 | 7 | 8000 | |
18000 | 8 | 11250 | Cap timer to 1/16th of 180s timer |
29250 | 9 | 11250 |
In our test, it seems that not being much more aggressive than the peer did result in the behavior that the testers expected. In real life, I think that the intuitions developed in the previous blog still hold. It is just that for the test, we have to please the protocol police…
If you want to start or join a discussion on this post, the simplest way is to send a toot on the Fediverse/Mastodon to @huitema@social.secret-wg.org.