Christian Huitema's blog

Cloudy sky, waves on the sea, the sun is
shining

Another Picoquic bug, not limiting the size of TLS messages

05 Sep 2024

I just fixed another bad bug in Picoquic. The bug, found by Asakura Mizu, relates to unbounded storage of TLS Stream.

The issue dates to the decision back in 2017 to embed TLS in QUIC, and to send the TLS data in a series of “crypto streams”. Initially, the crypto stream was just a data stream with the number 0, but after discussion we modeled the TLS stream as a set of 3 “crypto streams”, one for each of the three stages of the TLS handshake: initial mode, handshake mode, and application mode.

The crypto streams are a calque of the data stream, with two key differences: they use they own “CRYPTO_HS” frames that are similar but different from the stream data frames, and there is no flow control. Instead of flow control, the Section 7.5 of [RFC9000](https://www.rfc-editor.org/rfc/rfc9000.html] suggests limiting “data received in out-of-order CRYPTO frames”. The idea is that data received “in order” can be fed directly to the TLS controller, and thus does not need to be buffered, so the limit only applies to “out of order” data. This is a bit more lenient than simply limiting the overall size of TLS messages, but not by much: if a TLS message is split into several packets and the first one is lost, the amount of out of order data will be more or less the size of the message.

Somehow, I missed the memo and did not implement this “crypto flow control” in Picoquic. Thanks to Asakura Mizu for reminding me! The fix was easy: add a test of the amount of out of order data in the code, then add a in the test suite to verify that. To do the test I had to use error injection, which should have been simple but needed a fix. Of course I made a mistake in writing this fix, but the fuzz test in the test suite caught the error. The usual.

And then, I found out that the 4096 bytes limit suggested in Section 7.5 of [RFC9000](https://www.rfc-editor.org/rfc/rfc9000.html] caused one existing test to fail. That test verifies that the code can abide by the amplification limit of 3 specified in RFC 9000 even if the server’s first flight is very large, 8K in my test requiring 8 packets. The test anticipates the large messages than may be needed if we handle the large keys of Post Quantum algorithms. And the test include injection of packet losses, thus almost 8K of “out of order crypto data” — which showed that the 4KB limit was too low. I bumped it to 16KB in the code, and that works now.

I can’t help thinking that this “maximum amount of out of order TLS data” should be exposed in the QUIC protocol. The current state is that any host picks a limit, but that means senders do not know how much data they can send exactly. It leads to a game of handshake roulette: send more data than the peers limit and it will work very well most of the time, but cause a connection failure if the wrong packet is lost. A sender that wants to be safe should treat the limit as a flow control window, initially set to the lowest common denominator (4096 bytes) and opening progressively as packets are acknowledged. That will work, but be very slow. Maybe we should have a transport parameter explaining the local value of the window. It would not help the very first packets, but it would help everything else. In particular, if the largest TLS message is the server first flight, knowing the client’s flow control limit would help sending that flight quickly and safely.

Comments

If you want to start or join a discussion on this post, the simplest way is to send a toot on the Fediverse/Mastodon to @huitema@social.secret-wg.org.