IP fragmentation

1.IP fragmentation

IP fragmentation is an Internet Protocol (IP) process that breaks packets into smaller pieces (fragments), so that the resulting pieces can pass through a link with a smaller maximum transmission unit (MTU) than the original packet size. The fragments are reassembled by the receiving host.

RFC 791 describes the procedure for IP fragmentation, and transmission and reassembly of IP packets.^[1] RFC 815 describes a simplified reassembly algorithm.^[2] The Identification field along with the foreign and local internet address and the protocol ID, and Fragment offset field along with Don't Fragment and More Fragment flags in the IP protocol header are used for fragmentation and reassembly of IP packets.^[1]^:24^[2]^:9

If a receiving host receives a fragmented IP packet, it has to reassemble the packet and pass it to the higher protocol layer. Reassembly is intended to happen in the receiving host but in practice it may be done by an intermediate router, for example, network address translation (NAT) may need to reassemble fragments in order to translate data streams.^[3]

This is an old question and the IPv4 RFCs answer it pretty clearly. The idea was to split the problem into two separate concerns:

What is the maximum packet size that can be handled by operating systems on both ends?
What is the maximum permitted datagram size that can be safely pushed through the physical connections between the hosts?

When a packet is too big for a physical link, an intermediate router might chop it into multiple smaller datagrams in order to make it fit. This process is called "forward" IP fragmentation and the smaller datagrams are called IP fragments^[1].

Image by Geoff Huston, reproduced with permission

The IPv4 specification defines the minimal requirements. From the RFC791:

Every internet destination must be able to receive a datagram of 576 octets either in one piece or in fragments to be reassembled. [...] Every internet module must be able to forward a datagram of 68 octets without further fragmentation. [...]

The first value - permitted reassembled packet size - is typically not problematic. IPv4 defines the minimum as 576 bytes, but popular operating systems can cope with very big packets, typically up to 65KiB.

The second one is more troublesome. All physical connections have inherent datagram size limits, depending on the specific medium they use. For example Frame Relay can send datagrams between 46 and 4,470 bytes. ATM uses fixed 53 bytes, classical Ethernet can do between 64 and 1500 bytes.

The spec defines the minimal requirement - each physical link must be able to transmit datagrams of at least 68 bytes. For IPv6 that minimal value has been bumped up to 1,280 bytes (see RFC2460).

On the other hand, the maximum datagram size that can be transmitted without fragmentation is not defined by any specification and varies by link type. This value is called the MTU (Maximum Transmission Unit)^[2].

The MTU defines a maximum datagram size on a local physical link. The internet is created from non-homogeneous networks, and on the path between two hosts there might be links with shorter MTU values. The maximum packet size that can be transmitted without fragmentation between two remote hosts is called a Path MTU, and can potentially be different for every connection.

2.Impact of fragmentation on network forwarding

When a network has multiple parallel paths, technologies like LAG and CEF split traffic across the paths according to a hash algorithm. One goal of the algorithm is to ensure all packets of the same flow are sent out the same path to minimize unnecessary packet reordering.

IP fragmentation can cause excessive retransmissions when fragments encounter packet loss and reliable protocols such as TCP must retransmit all of the fragments in order to recover from the loss of a single fragment.^[5] Thus, senders typically use two approaches to decide the size of IP packets to send over the network. The first is for the sending host to send an IP packet of size equal to the MTU of the first hop of the source destination pair. The second is to run the path MTU discovery algorithm,^[6] to determine the path MTU between two IP hosts, so that IP fragmentation can be avoided.
Reduced Reliability
To successfully reassemble a packet, all fragments must be delivered. No fragment can become corrupt or get lost in-flight. There simply is no way to notify the other party about missing fragments!
The last fragment will almost never have the optimal size. For large transfers this means a significant part of the traffic will be composed of suboptimal short datagrams - a waste of precious router resources.
Before the re-assembly a host must hold partial, fragment datagrams in memory. This opens an opportunity for memory exhaustion attacks.
Subsequent fragments lack the higher-layer header. TCP or UDP header is only present in the first fragment. This makes it impossible for firewalls to filter fragment datagrams based on criteria like source or destination ports.

A more elaborate description of IP fragmentation problems can be found in these articles by Geoff Huston:

3.Avoid fragmentation Don't fragment - ICMP Packet too big

Image by Geoff Huston, reproduced with permission

A solution to these problems was included in the IPv4 protocol. A sender can set the DF (Don't Fragment) flag in the IP header, asking intermediate routers never to perform fragmentation of a packet. Instead a router with a link having a smaller MTU will send an ICMP message "backward" and inform the sender to reduce the MTU for this connection.

The TCP protocol always sets the DF flag. The network stack looks carefully for incoming "Packet too big"^[3] ICMP messages and keeps track of the "path MTU" characteristic for every connection^[4]. This technique is called "path MTU discovery", and it is mostly commonly used for TCP, although it can also be applied to other IP-based protocols. Being able to deliver the ICMP "Packet too big" messages is critical in keeping the TCP stack working optimally.

4.How the internet actually works

In a perfect world, internet connected devices would cooperate and correctly handle fragment datagrams and the associated ICMP packets. In reality though, IP fragments and ICMP packets are very often filtered out.

This is because the modern internet is much more complex than anticipated 36 years ago. Today, basically nobody is plugged directly into the public internet.

Customer devices connect through home routers which do NAT (Network Address Translation) and usually enforce firewall rules. Increasingly often there is more than one NAT installation on the packet path (e.g. carrier-grade NAT). Then, the packets hit the ISP infrastructure where there are ISP "middle boxes". They perform all manner of weird things on the traffic: enforce plan caps, throttle connections, perform logging, hijack DNS requests, implement government-mandated web site bans, force transparent caching or arguably "optimize" the traffic in some other magical way. The middle boxes are used especially by mobile telcos.

Similarly, there are often multiple layers between a server and the public internet. Service providers sometimes use Anycast BGP routing. That is: they handle the same IP ranges from multiple physical locations around the world. Within a datacenter on the other hand it's increasingly popular to use ECMP Equal Cost Multi Path for load balancing.

Each of these layers between a client and server can cause a Path MTU problem. Allow me to illustrate this with four scenarios.

1. Client -> Server DF+ / ICMP

In the first scenario, a client uploads some data to the server using TCP so the DF flag is set on all of the packets. If the client fails to predict an appropriate MTU, an intermediate router will drop the big packets and send an ICMP “Packet too big” notification back to the client. These ICMP packets might get dropped by misconfigured customer NAT devices or ISP middle boxes.

A bigger issue is with certain mobile ISPs with weird middle boxes. These often completely ignore ICMP and perform very aggressive connection rewriting. For example Orange Polska not only ignores inbound "Packet too big" ICMP messages, but also rewrites the connection state and clamps the MSS to a non-negotiable 1344 bytes.

2. Client -> Server DF- / fragmentation

In next scenario, a client uploads some data with a protocol other than TCP, which has the DF flag cleared. For example, this might be a user playing a game using UDP, or having a voice call. The big outbound packets might get fragmented at some point in the path.

There are multiple reasons why servers might mishandle fragments, but one of a popular problems is the use of ECMP load balancing. Due to the ECMP hashing, the first datagram containing a protocol header is likely to be load-balanced to a different server than the rest of the fragments, preventing the reassembly.

Furthermore, server and router misconfiguration is a significant issue. According to RFC7852 between 30% and 55% of servers drop IPv6 datagrams containing fragmentation header.

3. Server -> Client DF+ / ICMP

The next scenario is about a client downloading some data over TCP. When the server fails to predict the correct MTU, it should receive an ICMP “Packet too big” message. Easy, right?

Sadly, it's not, again due to ECMP routing. The ICMP message will most likely get delivered to the wrong server - the 5-tuple hash of ICMP packet will not match the 5-tuple hash of the problematic connection. We wrote about this in the past, and developed a simple userspace daemon to solve it. It works by broadcasting the inbound ICMP “Packet too big” notification to all the ECMP servers, hoping that the one with the problematic connection will see it.

Additionally due to Anycast routing, the ICMP might be delivered to the wrong datacenter altogether! Internet routing is often asymmetric and the best path from an intermediate router might direct the ICMP packets to the wrong place.

Missing ICMP “Packet too big” notifications can result in connections stalling and timing out. This is often called a PMTU blackhole. To aid this pessimistic case Linux implements a workaround - MTU Probing RFC4821. MTU Probing tries to automatically identify packets dropped due to the wrong MTU, and uses heuristics to tune it. This feature is controlled via a sysctl:

$ echo 1 > /proc/sys/net/ipv4/tcp_mtu_probing

But MTU probing is not without its own issues. First, it tends to miscategorize congestion-related packet loss as MTU issues. Long running connections tend to end up with a reduced MTU. Secondly, Linux does not implement MTU Probing for IPv6.

4. Server -> Client DF- / fragmentation

Finally, there is a situation where the server sends big packets using a non-TCP protocol with the DF bit clear. In this scenario, the big packets will get fragmented on the path to the client. This situation is best illustrated with big DNS responses. Here are two DNS requests that will generate large responses and be delivered to the client as multiple IP fragments:

$ dig +notcp +dnssec DNSKEY org @199.19.56.1 $ dig +notcp +dnssec DNSKEY org @2001:500:f::1

These requests might fail due to already mentioned the misconfigured home router, broken NAT, broken ISP installations, or too restrictive firewall settings.

According to Boer and Bosma around 6% of IPv4 and 10% of IPv6 hosts block inbound fragment datagrams.

we described the problems with detecting Path MTU values in the internet. ICMP and fragment datagrams are often blocked on both sides of the connections. Clients can encounter misconfigured firewalls, NAT devices or use ISPs which aggressively intercept connections. Clients also often use VPN's or IPv6 tunnels which, misconfigured, can cause path MTU issues.

Servers on the other hand increasingly often rely on Anycast or ECMP. Both of these things, as well as router and firewall misconfiguration are often a cause for ICMP and fragment datagrams being dropped.

In IPv6 the "forward" fragmentation works slightly differently than in IPv4. The intermediate routers are prohibited from fragmenting the packets, but the source can still do it. This is often confusing - a host might be asked to fragment a packet that it transmitted in the past. This makes little sense for stateless protocols like DNS. ↩︎
On a side note, there also exists a "minimum transmission unit"! In commonly used Ethernet framing, each transmitted datagram must have at least 64 bytes on Layer 2. This translates to 22 bytes on UDP and 10 bytes on TCP layer. Multiple implementations used to leak uninitialized memory on shorter packets! ↩︎
Strictly speaking in IPv4 the ICMP packet is named "Destination Unreachable, Fragmentation Needed and Don't Fragment was Set". But I find the IPv6 ICMP error description "Packet too big" much clearer. ↩︎
As a hint, TCP stack also include a maximum allowed "MSS" value in SYN packets (MSS is basically an MTU value reduced by size of IP and TCP headers). This allows the hosts to know what is the MTU on their links. Notice: this doesn't say what is the MTU on the dozens internet links between the two hosts! ↩︎
Let's err on the safe side. A better MTU is 1492, to accommodate for DSL and PPPoE connections.

5.Solution

While this document identifies issues associated with IP fragmentation, it does not recommend deprecation. Some applications (see Section 6) require IP fragmentation. Furthermore, fragmentation is expected to work in limited domains where security and interoperability issues can be addressed.

Rather than deprecating IP Fragmentation, this document recommends that upper-layer protocols address the problem of fragmentation at their layer, reducing their reliance on IP fragmentation to the greatest degree possible.

5.1Alternatives to IP Fragmentation

5.1.1 Transport Layer Solutions

The Transport Control Protocol (TCP)) can be operated in a mode that does not require IP fragmentation.

Applications submit a stream of data to TCP. TCP divides that stream of data into segments, with no segment exceeding the TCP Maximum Segment Size (MSS). Each segment is encapsulated in a TCP header and submitted to the underlying IP module. The underlying IP module prepends an IP header and forwards the resulting packet.

If the TCP MSS is sufficiently small, the underlying IP module never produces a packet whose length is greater than the actual PMTU. Therefore, IP fragmentation is not required.

TCP offers the following mechanisms for MSS management:

Manual configuration
PMTUD
PLPMTUD

Manual configuration is always applicable. If the MSS is configured to a sufficiently low value, the IP layer will never produce a packet whose length is greater than the protocol minimum link MTU. However, manual configuration prevents TCP from taking advantage of larger link MTU's.

Upper-layer protocols can implement PMTUD in order to discover and take advantage of larger path MTUs. However, as mentioned in Section 2.1, PMTUD relies upon the network to deliver ICMP PTB messages. Therefore, PMTUD is applicable only in environments where the risk of ICMP PTB loss is acceptable.

By contrast, PLPMTUD does not rely upon the network's ability to deliver ICMP PTB messages. However, in many loss-based TCP congestion control algorithms, the dropping of a packet may cause the TCP control algorithm to drop the congestion control window, or even re-start with the entire slow start process. For high capacity, long round-trip time, large volume TCP streams, the deliberate probing with large packets and the consequent packet drop may impose too harsh a penalty on total TCP throughput for it to be a viable approach. [RFC4821] defines PLPMTUD procedures for TCP.

While TCP will never cause the underlying IP module to emit a packet that is larger than the PMTU estimate, it can cause the underlying IP module to emit a packet that is larger than the actual PMTU. If this occurs, the packet is dropped, the PMTU estimate is updated, the segment is divided into smaller segments and each smaller segment is submitted to the underlying IP module.

The Datagram Congestion Control Protocol (DCCP) and the Stream Control Protocol (SCP) also can be operated in a mode that does not require IP fragmentation. They both accept data from an application and divide that data into segments, with no segment exceeding a maximum size. Both DCCP and SCP offer manual configuration, PMTUD and PLPMTUD as mechanisms for managing that maximum size. [I-D.ietf-tsvwg-datagram-plpmtud] proposes PLPMTUD procedures for DCCP and SCP.

Currently, User Data Protocol (UDP) lacks a fragmentation mechanism of its own and relies on IP fragmentation. However, [I-D.ietf-tsvwg-udp-options] proposes a fragmentation mechanism for UDP.

5.1.2 Application Layer Solutions

[RFC8085] recognizes that IP fragmentation reduces the reliability of Internet communication. It also recognizes that UDP lacks a fragmentation mechanism of its own and relies on IP fragmentation. Therefore, [RFC8085] offers the following advice regarding applications the run over the UDP.

"An application SHOULD NOT send UDP datagrams that result in IP packets that exceed the Maximum Transmission Unit (MTU) along the path to the destination. Consequently, an application SHOULD either use the path MTU information provided by the IP layer or implement Path MTU Discovery (PMTUD) itself to determine whether the path to a destination will support its desired message size without fragmentation."

RFC 8085 continues:

"Applications that do not follow the recommendation to do PMTU/PLPMTUD discovery SHOULD still avoid sending UDP datagrams that would result in IP packets that exceed the path MTU. Because the actual path MTU is unknown, such applications SHOULD fall back to sending messages that are shorter than the default effective MTU for sending (EMTU_S in [RFC1122]). For IPv4, EMTU_S is the smaller of 576 bytes and the first-hop MTU. For IPv6, EMTU_S is 1280 bytes. The effective PMTU for a directly connected destination (with no routers on the path) is the configured interface MTU, which could be less than the maximum link payload size. Transmission of minimum-sized UDP datagrams is inefficient over paths that support a larger PMTU, which is a second reason to implement PMTU discovery."

RFC 8085 assumes that for IPv4, an EMTU_S of 576 is sufficiently small, even though the IPv4 minimum link MTU is 68 bytes.

This advice applies equally to application that run directly over IP.

5.2 Recommendations

5.2.1. For Application Developers

Application developers SHOULD NOT develop new applications that rely on IP fragmentation.

Application-layer protocols that depend upon IPv6 fragmentation SHOULD be updated to break that dependency. This can be achieved by using a sufficiently small MTU (e.g. The protocol minimum link MTU), disabling fragmentation, and ensuring that the transport protocol in use adapts its segment size to that MTU. This would avoid the problem of PMTUD failure described in Section 4.6. Another approach is to use PLPMTUD in a way suitable for the transport protocol in use (e.g. [I-D.ietf-tsvwg-datagram-plpmtud] for UDP).

5.2.2. For System Developers

Software libraries SHOULD include provision for PLPMTUD for each supported transport protocol.

5.2.3. For Middle Box Developers

Middle box developers SHOULD implement devices that support IP fragmentation. These boxes SHOULD not fail or cause failures when processing fragmented IP packets.

For example, in order to support IP fragmentation, a load balancer might execute the following procedure:

Receive a fragmented packet
Identify a next-hop using information drawn from the first fragment (i.e., the fragment containing offset 0)
Forward the first fragment and all subsequent fragments through the above-mentioned next-hop

5.2.4. For Network Operators

As per RFC 4890, network operators MUST NOT filter ICMPv6 PTB messages unless they are known to be forged or otherwise illegitimate. As stated in Section 4.6, filtering ICMPv6 PTB packets causes PMTUD to fail. Operators MUST ensure proper PMTUD operation in their network, including making sure the network generates PTB packets when dropping packets too large compared to outgoing interface MTU.

Many upper-layer protocols rely on PMTUD.

6.reference

IP fragmentation

https://en.wikipedia.org/wiki/IP_fragmentation#Impact_of_fragmentation_on_network_forwarding

Broken packets: IP fragmentation is flawed

https://blog.cloudflare.com/ip-fragmentation-is-broken/

IP Fragmentation Considered Fragile

https://tools.ietf.org/id/draft-ietf-intarea-frag-fragile-02.html