Re: "IP PMTU discovery"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



you should copy the network developers mailing list netdev@vger with this.

On 11/29/06, Brian Candler <B.Candler@xxxxxxxxx> wrote:
Hello,

I'd like to raise a point for discussion, and start by giving a short C
program to demonstrate it. It just sends a UDP datagram to 1.2.3.4 and exits.

-------- 8< -----------------------------------
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main(void)
{
    int s;
    char buf[] = "abc";
    int buflen = 3;
    struct in_addr t;
    struct sockaddr_in to;

    if ((s = socket (PF_INET, SOCK_DGRAM, 0)) < 0) {
        perror("socket");
        exit(1);
    }

    to.sin_family = AF_INET;
    t.s_addr = htonl(0x01020304);
    memcpy(&to.sin_addr, &t.s_addr, 4);
    to.sin_port = htons(9999);

    sendto (s, buf, buflen, 0,
            (struct sockaddr *) &to, sizeof (to));

    return 0;
}
-------- 8< -----------------------------------

When run under FreeBSD, it sends a UDP packet with DF=0. But when run under
Linux, it sends a UDP packet with DF=1.

My issue is that this causes many UDP-based applications to fail under Linux
if they want to send packets bigger than the smallest MTU along the path. I
was bitten by this once when working on L2TP, and have been bitten by it
again when working with SIP. So what's the reasoning behind it?

I'm aware that many operating systems, including FreeBSD and Linux,
implement something called "TCP Path MTU Discovery". This is a well-defined
and documented service (RFC 1191). When an application asks for TCP PMTU
discovery, it's asking the kernel to adjust the TCP MSS automatically so
that fragmentation is avoided. This gives a useful improvement in
efficiency, works most of the time, and is hence usually enabled by default.
Most importantly, it is *completely transparent to the application*.

Linux appears to implement something more amorphous called "IP Path MTU
Discovery". It is enabled by default:

$ cat /proc/sys/net/ipv4/ip_no_pmtu_disc
0

As far as I can tell, what it does is to set the DF bit on every IP
datagram. In other words, if the application asked for this, it would be
saying: "I'd rather this datagram was never delivered at all than it had to
be fragmented and reassembled".

This seems to be a very strange "service" for the kernel to provide to the
application, and it's very definitely not transparent.

If the packet is too large for the interface it's leaving from, the send()
call will fail. However if it hits a smaller MTU downstream, I don't think
the application will know unless it checks for ICMP Frag Needed packets.

Even if the application detects the problem, it may or may not be able to do
anything about it. Some protocols may have their own way of splitting
messages between multiple datagrams. Some applications may be able to switch
transports entirely: a DNS resolver could switch from using UDP to TCP, for
instance. A SIP client in theory could do this, but in practice many SIP
peers don't support TCP. However, something like an L2TP relay has
absolutely nothing it can do about it.

Now, it seems to me that the Linux kernel is taking an unnecessarily
pessimistic view on the evilness of fragmentation.

As I understand it, the philosophy of IP is that it is a network protocol
which hides the user from the details of the underlying media which carry
the datagrams, and allows these different media to interconnect. Using IP
addresses instead of native layer 2 addressing is one way of providing this
abstraction; the fragmentation mechanism is another. If an application wants
to send a datagram of up to 64KB, it is entitled to do so without regard to
the capabilities of the underlying network(s) it may traverse.

Fragmentation comes at a cost of course. There's some work involved at the
point where fragmentation takes place (often the sender, in the case of
someone sending SIP or L2TP packets over 1500 bytes); there's a cost in
buffering and reassembly at the recipient; and the total data size
increases. However, I don't think any of these are unreasonable. The
alternative that Linux offers is simply to blackhole the packet entirely.

Now, for those applications which cannot tolerate having their packets
blackholed, there are a couple of workarounds that I'm aware of:

(1) You can disable "IP PMTU discovery" globally. However that also disables
TCP PMTU discovery, which is a desirable service.

(2) You can modify the code explicitly to ask the kernel not to provide this
"value add service" on a particular socket:

--- testsock.c  2006-04-10 14:50:14.000000000 +0100
+++ testsock-linux.c    2006-11-29 13:52:18.000000000 +0000
@@ -18,6 +18,14 @@
         exit(1);
     }

+    {
+        int val = IP_PMTUDISC_DONT;
+        if (setsockopt(s, IPPROTO_IP, IP_MTU_DISCOVER, &val, sizeof(val)) < 0)
+        {
+            fprintf(stderr, "Failed to disable PMTU discovery\n");
+        }
+    }
+
     to.sin_family = AF_INET;
     t.s_addr = htonl(0x01020304);
     memcpy(&to.sin_addr, &t.s_addr, 4);

But that involves modifying every bit of UDP-sending code out there, and
it's also very Linux-specific so it needs to be wrapped in autoconf, or at
least #ifdef, for portable use.

I have had to talk to manufacturers of Linux-embedded devices (such as SIP
ATAs) and explain to them what happens when a SIP packet goes above 1500
bytes, as it can when it offers six codecs and goes through a couple of SIP
proxies, and show them how they can fix their code using the above patch.

So I was just wondering:

- can someone explain the rationale for Linux's behaviour?

- is there perhaps a case for having separate knobs to disable
  "IP PMTU discovery" without disabling TCP PMTU discovery?

Regards,

Brian.
-
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Netdev]     [Ethernet Bridging]     [Linux 802.1Q VLAN]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Git]     [Bugtraq]     [Yosemite News and Information]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux PCI]     [Linux Admin]     [Samba]

  Powered by Linux