Christian Huitema's blog

Cloudy sky, waves on the sea, the sun is
shining

Having fun and surprises with IPv6

03 Mar 2018

(Corrected March 4, 2018)

I am participating in the standardization of the QUIC protocol. That's why I am writing a prototype implementation of the new transport protocol: Picoquic. And the development involves regular testing against other prototypes, the result of which are shown in this test result matrix. This is work in progress on a complex protocol development, so when we test we certainly expect to find bugs and glitches, either because the spec is not yet fully clear, or because somewhat did not read it correctly. But this morning, I woke up to find an interesting message from Patrick McManus, who is also developing an implementation of QUIC at Mozilla. Something is weird, he said. The first data message that your server sends, with sequence number N, always arrives before the final handshake message, with sequence number N-1. That inversion appears to happen systematically.

We both wondered for some time what kind of bug this was, until finally we managed to get a packet capture of the data:

And then, we were able to build a theory. The exchange was over IPv6. Upon receiving a connection request from Patrick's implementation, it was sending back a handshake packet. In QUIC, the handshake packet responds to the connection request with a TLS "server hello" message and associated extensions, which set the encryption keys for the connection. Immediately after that, my implementation was sending its first data packet, which happens to be an MTU probe. This is a test message with the largest plausible length. If it is accepted, the connection can use message of that length. If it is not, it will have to use shorter packet.

It turns out that the messages that the implementation was sending were indeed a bit long, and they triggered an interesting behavior. My test server runs in a virtual machine in AWS. AWS exposes an Ethernet interface to virtual machines~~, but applications cannot use the full Ethernet packet length, 1536 bytes. The AWS machinery probably use some form of tunneling, which reduces the available size~~. The test message that I was sending was 1518 bytes, ~~but that was still too long~~ which is larger than the 1500 bytes MTU. ~~Some router on the path, probably AWS,~~ The IPv6 fragmentation code in the Linux kernel splits it in two: a large initial fragment, 1496 byte long, and a small second fragment 78 bytes long. You could think that fragmentation is no big deal, since fragments would just be reassembled at the destination, but you would be wrong.

Some routers on the path try to be helpful. They have learned from past experience that short packets often carry important data, and so they try to route them faster than long data packets. And here is what happens in our case:

The server prepares and send a Handshake packet, 590 bytes long.
The server then prepares the MTU probe, 1518 bytes long.
The MTU probe is split into fragment 1, 1496 bytes, and fragment 2, 78 bytes.
The handshake and the long fragment are routed on the normal path, but the small fragment is routed at a higher priority level.
The Linux driver at the destination receives the small fragment first. It queues everything behind that until it receives the long fragment.
The Linux driver passes the reassembled packet to the application, which cannot do anything with it because the encryption keys can only be obtained from the handshake packet.
The Linux driver then passes the handshake packet to the application.

And it took the two of us the best part of the day to explore possible failure modes in our code before we understood that. Which confirms an old opinion. When routers try to be smart and helpful, they end up being dumb and harmful. Please just send the packets in the order you get them!

More explanations, after 1 day of comments and further inquiries

It seems clear now that the fragmentation happened at the source, in the Linux kernel. This leaves one remaining issue, the out of order delivery.

There are in fact two separate out of order delivery issues. One is having the second fragment arrive between the first one, and the second is having the MTU probe arrive before the previously sent Handshake packet. The inversion between the segments may be due to code in the Linux kernel that believes that sending the last segment first speeds up reassembly at the receiver. The inversion between the fragmented MTU probe and the Handshake packet has two plausible causes:

Some router on the path may be forwarding the small fragment at a higher priority level.
Some routers may be implementing equal cost multipath, and then placing fragmented packets and regular packets into separate hash buckets

The summary for developers, and for QUIC in particular, is that we should really avoid triggering IPv6 fragmentation. It can lead to packet losses when NATs and firewalls cannot find the UDP payload type and the port numbers in the fragments. And it can also lead to out of order delivery as we just saw. And for my own code, the lesson is simple. I really need to set up the IPv6 Don't Fragment option when sending MTU probes, per section 11.2 of RFC 3542.