Reliable transmission

Computers are unreliable; networks made of computers are extra unreliable. On a large-scale network like the Internet, failure is a normal part of operation and must be accommodated. In a packet network, this means retransmission: if the client receives packets number 1 and 3, but doesn't receive 2, then it needs to ask the server to re-send the missing packet.

When receiving thousands of packets per second, as in our 88.5 MB video download, mistakes are almost guaranteed. To demonstrate that, let's return to my Wireshark capture of the download. For thousands of packets, everything goes normally. Each packet specifies a "next sequence number", followed by another packet with that number.

Suddenly, something goes wrong. The 6,269th packet has a "next sequence number" of 7,208,745, but that packet never comes. Instead, a packet with sequence number 7,211,609 arrives. This is an out-of-order packet: something is missing.

We can't tell exactly what went wrong here. Maybe one of the intermediate routers on the Internet was overloaded. Maybe my local router was overloaded. Maybe someone turned a microwave on, introducing electromagnetic interference and slowing my wireless connection. In any case, the packet was lost and the only indication is the unexpected packet.

TCP has no special "I lost a packet!" message. Instead, ACKs are cleverly reused to indicate loss. Any out-of-order packet causes the receiver to re-ACK the last "good" packet – the last one in the correct order. In effect, the receiver is saying "I received packet 5, which I'm ACKing. I also received something after that, but I know it wasn't packet 6 because it didn't match the next sequence number in packet 5."

If two packets simply got switched in transit, this will result in a single extra ACK and everything will continue normally after the out-of-order packet is received. But if the packet was truly lost, unexpected packets will continue to arrive and the receiver will continue to send duplicate ACKs of the last good packet. This can result in hundreds of duplicate ACKs.

When the sender sees three duplicate ACKs in a row, it assumes that the following packet was lost and retransmits it. This is called TCP fast retransmit because it's faster than the older, timeout-based approach. It's interesting to note that the protocol itself doesn't have any explicit way to say "please retransmit this immediately!" Instead, multiple ACKs arising naturally from the protocol serve as the trigger.

(An interesting thought experiment: what happens if some of the duplicate ACKs are lost, never reaching the sender?)

Retransmission is common even in networks working normally. In a capture of our 88.5 MB video download, I saw this:

  • The congestion window quickly increases to about a megabyte due to continuing successful transmission.
  • A few thousand packets show up in order; everything is normal.
  • One packet comes out of order.
  • Data continues pouring in at megabytes per second, but the packet is still missing.
  • My machine sends dozens of duplicate ACKs of the last known-good packet, but the kernel also stores the pending out-of-order packets for later reassembly.
  • The server receives the duplicate ACKs and resends the missing packet.
  • My client ACKs both the previously-missing packet and the later ones that were already received due to out-of-order transmission. This is done by simply ACKing the most recent packet, which implicitly ACKs all earlier ones as well.
  • The transfer continues, but with a reduced congestion window due to the lost packet.

This is normal; it's happened in every capture of the full download that I've done. TCP is so successful at its job that we don't even think of networks as being unreliable in our daily use, even though they fail routinely under normal conditions.

This is one section of The Programmer's Compendium's article on Network Protocols, which contains more details and context.