Bug with ARP - request source address on wrong subnet

Richard Underwood <richard@aspectgroup.co.uk> · Fri, 15 Aug 2003 13:03:51 +0100



Hi,

	I have a problem with ARP on Linux 2.4.20 (RedHat 2.4.20-18.8 if it
matters) which I believe to be a bug. While I'm willing to upgrade the
kernel, it appears to be a generic problem.

	Our web servers are load-balanced via a Foundry ServerIron using DSR
- which means the return path of the packets doesn't go through the
ServerIron. To allow this to work, the Linux servers have the ServerIron's
valid IP address on a loopback interface and the ServerIron routes packets
rather than the usual address rewriting that goes on.

	The relevant interfaces look like this:

eth0      Link encap:Ethernet  HWaddr 00:04:75:CA:C4:EF  
          inet addr:10.10.10.14  Bcast:10.10.10.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1623551911 errors:0 dropped:0 overruns:1 frame:0
          TX packets:1575017402 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:2905003530 (2770.4 Mb)  TX bytes:3337437145 (3182.8 Mb)
          Interrupt:10 Base address:0x8400 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:355748 errors:0 dropped:0 overruns:0 frame:0
          TX packets:355748 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:237452671 (226.4 Mb)  TX bytes:237452671 (226.4 Mb)

lo:0      Link encap:Local Loopback  
          inet addr:212.xxx.yyy.9  Mask:255.255.255.255
          UP LOOPBACK RUNNING  MTU:16436  Metric:1

	The default gateway is 10.10.10.1.

	All this works very well - except we have problems with ARP. After
shutting down the web server for a while, the load balancer sees it come
back up, but the web server can't route packets outbound at all.

	Looking into it, the following demonstrates the problem:

# arp -d 10.10.10.1
# ping -I 212.xxx.yyy.9 eff.org
PING eff.org (209.237.229.14) from 212.xxx.yyy.9 : 56(84) bytes of data.
^C
# arp -a | grep 10.10.10.1
? (10.10.10.1) at <incomplete> on eth0

	On eth0, we see:

11:23:55.650514 0:4:75:ca:c4:ef Broadcast arp 42: arp who-has 10.10.10.1
tell 212.xxx.yyy.9
     0001 0800 0604 0001 0004 75ca c4ef d4xx
     yy09 0000 0000 0000 0a0a 0a01

	The <incomplete> ARP entry remains, blocking all access via the
default gateway. If I miss off the -I 212.xxx.yyy.9, the ARP request
originates from 10.10.10.14 instead and everything works fine.

	The problem only occurs after a time of inactivity, and only if the
first ARP request is due to traffic to the 212.xxx.yyy.9 address. Because
the incomplete ARP entry remains, traffic that would normally cause valid
ARP requests don't generate new requests, causing a complete loss of
connectivity.

	As I understand it, sending an ARP request with a reply address that
isn't on the local subnet simply doesn't make sense. Section A.3 of RFC985
also suggests such packets should be dropped by the next hop.

	The temporary solution is to add static ARP entries for the next
hop, which I will do - however, I believe this is a bug with the Linux
implementation of ARP and should be fixed.

	Thanks,

		Richard
-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html