Christian Huitema's blog

Cloudy sky, waves on the sea, the sun is
shining

QUIC timeouts and Handshake Interop

19 Jan 2024

Marten Seemann made a great contribution to QUIC interoperability by setting the QUIC interop runner. The site runs series of interoperability tests between participating QUIC implementations (17 of them when I am writing this) and reports that in a large result matrix. It is a nice complement to the internal tests of the implementations, and it was flagging an interesting issue: the test L1 was failing between ngtcp2 client and picoquic server.

The test codenamed L1 verifies that implementations can successfully establish connections in presence of high packet loss. The test consists of 50 successive connection attempts, followed by the download of a short 1KB document. The connections are run over a network simulation programmed to drop 30% of packets. The test succeeds if all connections succeed and all 50 documents are retrieved.

In the “ngtcp2 to picoquic” tests, all documents were properly downloaded, but the analysis of traffic showed 51 connection attempts instead of the expected 50, and thus the test was marked failing. It took me a while to parse the various logs and understand why this was happening, but it turned out to be a timeout issue. One of the 50 tests ran like this:

NGTCP2 client sends an “Initial” packet to start the connection,
Picoquic server receives the packet, creates a connection context, and sends a response.
The simulator drops that response.
After loss detection timers expire, NGTCP2 repeats the Initial message.
In fact, NGTCP2 tries to repeat the Initial packet multiple times, but each time, the simulator drops the message.
After a “handshake timer” set by default to 30 seconds, Picoquic deletes the context that it created on reception of the first message.
NGTCP2 repeats the Initial packet one more time, and this time it is delivered.
Picoquic does not find any existing context for that incoming packet, so it creates a new connection.
The second connection succeeds.

Nobody is really at fault here — NGTCP2 behaves exactly as the standard mandates, and it is perfectly legal for the Picoquic server to drop contexts after absence of activity for some period. In fact, servers should to do just that in case of DOS attacks. But explaining to the testers that “we are failing your test because it is too picky” is kind of hard. There was a simpler fix: just configure Picoquic to use longer timers, 180 seconds instead of 30. With that, the context is still present when the finally successful repeat packet arrives. Picouic creates just one connection, and everybody is happy.

But still, Picoquic was using a short handshake timer for a reason: if connections are failing, it makes sense to clean them up quickly. The L1 test between Picoquic client and server was passing despite the short timers, because Picoquic’s loss recovery process is more aggressive than what the standard specifies. The standard specifies a conservative strategy that uses “exponential backoff”, doubling the value of the timer after each failure, for the following timeline:

Time (standard)	Number	Timeout(ms)
0	1	300
300	2	600
900	3	1200
2100	4	2400
4500	5	4800
9300	6	9600
18700	7	19200
37900	8	38400
76300	9	76800

Picoquic deviates from that strategy, as discussed in Suspending the Exponential Backoff. The timeline is much more aggressive:

Time (picoquic)	Number	Timeout(ms)
0	1	250
250	2	250	Not doubling on first PTO
500	3	500
1000	4	1000
2000	5	1875	Cap timer to 1/16th of 30s timer
3875	6	1875
5750	7	1875
7625	8	1875
9500	9	1875

After configuring the handshake timer to 180 seconds, the Picoquic sequence is still more aggressive than the standard, but the difference is smaller:

Time (picoquic)	Number	Timeout(ms)
0	1	250
250	2	250	Not doubling on first PTO
500	3	500
1000	4	1000
2000	5	2000
6000	6	4000
10000	7	8000
18000	8	11250	Cap timer to 1/16th of 180s timer
29250	9	11250

In our test, it seems that not being much more aggressive than the peer did result in the behavior that the testers expected. In real life, I think that the intuitions developed in the previous blog still hold. It is just that for the test, we have to please the protocol police…

Comments

If you want to start or join a discussion on this post, the simplest way is to send a toot on the Fediverse/Mastodon to @huitema@social.secret-wg.org.