RE: ARP cache pollution?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> -----Original Message-----
> From: Neil Horman [mailto:nhorman@xxxxxxxxxxxxx] 
> Sent: Thursday, April 05, 2007 4:44 AM
> To: Jeff Haran
> Cc: linux-net@xxxxxxxxxxxxxxx
> Subject: Re: ARP cache pollution?
> 
> On Mon, Apr 02, 2007 at 03:47:48PM -0700, Jeff Haran wrote:
> > Hi,
> > 
> > I have a PPC32 based embedded system running linux 2.6.14. It has 
> > multiple ethernet interfaces.
> > 
> > I find that if I configure two of those interfaces (eth0 and eth1) 
> > with separate IPv4 addresses on the same subnet and connect 
> them both 
> > to that LAN, the ARP cache on the system seems to grow with ARP 
> > entries to other hosts on that subnet. If I run the 
> following script:
> > 
> > while [ 1 ]; do
> > sleep 1
> > arp -a | wc -l
> > done
> > 
> > the script will report an ever increasing number of lines in the 
> > output of arp -a, until it gets to 100 or so at which point 
> I suspect 
> > the neighbor table garbage collection code kicks in and reported 
> > number of entries in the ARP cache drops to 2 very quickly. But as 
> > soon as that happens, it starts growing again. This is 
> happening in a 
> > fairly big lab network so I have no reason to believe that the ARP 
> > entries do not correspond to valid hosts on that LAN. When 
> I telnet to 
> > them I find they are usually Windows machines.
> > 
> > Sometimes, this is accompanied by:
> > 
> > "Neighbour table overflow."
> > 
> > messages. I suspect this is originating from the garbage collection 
> > code where the number of entries in the ARP cache exceeds 
> gc_thresh3 
> > and thus attempts to allocate a neighbor structure fail. But 
> > gc_thread3 is 1024 and I have never seen arp -a display 
> anything close 
> > to that many entries. Here's an example of it while running 
> the above script:
> > 
> > 	95
> > 	95
> > 	96
> > Neighbour table overflow.
> > Neighbour table overflow.
> > 	97
> > 	97
> > 	98
> > 	99
> > 
> > This only occurs when I have the multiple interfaces 
> configured on the 
> > same subnet. If I ifconfig one of them down, the content of the ARP 
> > cache remains very small (typically 2 though sometimes up 
> to 5 or so).
> > 
> > So I suspect this has something to do with ARPs from the 
> same network 
> > being received on both interfaces. I note that if I look at 
> the output 
> > of arp -a, sometimes a given other host's IP address will be 
> > associated with eth0 and sometimes with eth1.
> > 
> > But what I don't understand is why the kernel would be 
> caching these 
> > ARP entries in the first place. The "Neighbour table overflow." 
> > messages are often accompanied by loss of ability to 
> communicate with the system.
> > I'll run arp -a on it and it will show one incomplete ARP 
> entry to the 
> > systems's default router.
> > 
> > Has anybody ever seen anything like this? 
> > 
> 
> This is happening for exactly thre reason you are describing. 
>  Arp table entries are use the tuple <ip,dev> as a key for 
> matching when lookup up arp table entries.  since you are 
> receiving arp responses on both interfaces for the same 
> subnet, you get two entries for each host on your system (why 
> you have 100 entries as opposed to two when you only use 1 
> interface, I'm not sure). 

That's the mysterious part. I am suspecting that the presense of
multiple interfaces on the same network is stimulating some weird
behavior from other hosts on the LAN that is causing this growth in ARP
entries. Guess I'll have to do some tcpdumping to figure that one out,
but its possible it has nothing to do with the other hosts on the LAN
(see below).

> In any case, the kernel treats 
> each arp packet as a unique entry since they arrived on 
> separate interfaces.
> 
> > Is there some tweaking down in /proc/sys/net that can make 
> this stop?
> You have a few choices:
> 
> 1) Don't worry about it.  This is clearly a potential 
> performance problem, but until something really bad happens, 
> it would appear that you are resolving arp entries when you 
> need them, and so functionally you should be ok.

Except that sometimes when the target system starts generating these
messages, I can't communicate with it at all. If I login thru the serial
port and try pinging its default router, no response. If I run arp -a at
that point I see one and only one ARP entry to the default router,
incomplete. This is not easily reproduced and could be unrelated
ethernet link problems.

> 
> 2) Up the gc_thresh[1,2,3] values to support ~100 entries.  
> That should allow your arp table to grow to the size it needs 
> to be (one entry for each MAC on each interface) and quiesce 
> there.  It should remove your overflow warning messages.

These thresholds are set to 128, 512 and 1024, respectively. When I
first ran into this, I started eyeballing the code that generates the
message to figure out where it could be coming from. So I added the
following to net/core/neighbour.c:neigh_alloc():

static struct neighbour *neigh_alloc(struct neigh_table *tbl)
{
    struct neighbour *n = NULL;
    unsigned long now = jiffies;
    int entries;

    entries = atomic_inc_return(&tbl->entries) - 1;
    if (entries >= tbl->gc_thresh3 ||
        (entries >= tbl->gc_thresh2 &&
         time_after(now, tbl->last_flush + 5 * HZ))) {
        if (!neigh_forced_gc(tbl) &&
            entries >= tbl->gc_thresh3) {

>            if (net_ratelimit()) {
>                printk(KERN_WARNING "neigh_alloc(%s): "
>                    "entries %d gc_thresh3 %d gc_thresh2 %d\n",
>                    tbl->id, entries, tbl->gc_thresh3,
tbl->gc_thresh2);
>            }
            goto out_entries;
        }
    }
    ...

This results in the following while running my test script:

     95
     95
printk: 1 messages suppressed.
neigh_alloc(arp_cache): entries 1024 gc_thresh3 1024 gc_thresh2 512
Neighbour table overflow.
neigh_alloc(arp_cache): entries 1024 gc_thresh3 1024 gc_thresh2 512
Neighbour table overflow.
neigh_alloc(arp_cache): entries 1024 gc_thresh3 1024 gc_thresh2 512
Neighbour table overflow.
neigh_alloc(arp_cache): entries 1024 gc_thresh3 1024 gc_thresh2 512
Neighbour table overflow.
neigh_alloc(arp_cache): entries 1024 gc_thresh3 1024 gc_thresh2 512
Neighbour table overflow.
     95
     96
     96

So the output of arp -a is showing 90 or so entries and then all of a
sudden tbl->entries has 1024 in it. Either I am getting stormed with
connections from over 900 hosts on the LAN within the same second, or
there is something funky with how tbl->entries is being maintained when
multiple interfaces are configured on the same net.

> 3) Present one interface to the network.  Depending on your 
> need for multiple interfaces, it may be to your advantage to 
> use the bonding driver to bond both interfaces into one 
> logical interface.  This presents one interface to the 
> network, and results in getting only one arp respnse 
> presented to the kernel, and should solve your problem.

That's work for some future product release. Until then, these products
have two ethernet management interfaces so either I have to figure out
why this is happening and fix it or somehow restrict our customers from
configuring both interfaces on the same network.

Thanks,

Jeff Haran
Brocade

> Regards
> Neil
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Netdev]     [Ethernet Bridging]     [Linux 802.1Q VLAN]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Git]     [Bugtraq]     [Yosemite News and Information]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux PCI]     [Linux Admin]     [Samba]

  Powered by Linux