Re: neighbor table overflow

"Marco C. Coelho" <maillist1@xxxxxxxxxxxxx> · Fri, 07 Dec 2007 11:17:09 -0600

Ok, I hope this helps someone else out there when they google neighbor
table overflow solution linux kernel:

This is just an update to state that since gc_thresh1 was increased to
a number greater than the number of simultaneous connected PPPoE
clients on this box, it has not given me the neighbor table problem. 

So set gc_thresh1 greater than the number of local connections you get
with:

ip route | grep link | wc -l

So in /etc/sysctl.conf add something like:

# Added to stop "neighbor table overflow" messages in the kernel

net.ipv4.neigh.default.gc_thresh1=1024

net.ipv4.neigh.default.gc_thresh2=2048

net.ipv4.neigh.default.gc_thresh3=4096

# Added to increase IP contrack number (was getting to max)

net.ipv4.ip_conntrack_max=99999 

Have a Merry Christmas!

Marco Coelho

Argon Technologies Inc.

www.argontech.net

Marco C. Coelho wrote:

Still beating the same bush!

I've done all the possible suggestions so far.  I still was getting a
neighbor table overflow.

Looking at the MAN 7 ARP pages, I see:

       gc_thresh1

              The minimum number of entries to keep in the ARP cache. 
The garbage collector will not run if there are

              fewer than this number of entries in the cache.  Defaults
to 128.

       gc_thresh2

              The soft maximum number of entries to keep in the ARP
cache.  The garbage collector will allow the  num-

              ber of entries to exceed this for 5 seconds before
collection will be performed.  Defaults to 512.

       gc_thresh3

              The  hard  maximum number of entries to keep in the ARP
cache.  The garbage collector will always run if

              there are more than this number of entries in the cache. 
Defaults to 1024.

Since this box never gets less than 500 pppoe connections, this Sat I
changed

                          WAS     NOW   

gc_thresh1      512         1024

gc_thresh2     2048        2048

gc_thresh3     4096        4096 

what's strange is when I do an 'arp -an' I only get three entries back.
(ips changed to protect the guilty).  Shouldn't this show the arp
entries

? (x.202.x.3) at 00:03:47:2D:8B:F9 [ether] on eth0

? (x.202.x.1) at 00:03:E3:88:EC:C2 [ether] on eth0

? (x.202.x.2) at 00:18:8B:76:EC:D8 [ether] on eth0

? (x.202.x.9) at 00:90:27:43:C2:CF [ether] on eth0

ip route | grep link provides:

snip (lots of pppoe connects)

x.202.x.237 dev ppp53  proto kernel  scope link  src 10.20.1.1 

x.202.x.235 dev ppp339  proto kernel  scope link  src 10.20.1.1 

x.202.x.232 dev ppp185  proto kernel  scope link  src 10.20.1.1 

x.202.x.231 dev ppp313  proto kernel  scope link  src 10.20.1.1 

x.202.x.230 dev ppp67  proto kernel  scope link  src 10.20.1.1 

x.202.x.226 dev ppp74  proto kernel  scope link  src 10.20.1.1 

x.202.x.224 dev ppp150  proto kernel  scope link  src 10.20.1.1 

x.202.x.0/24 dev eth0  proto kernel  scope link  src x.202.224.8 

192.168.1.0/24 dev eth3  proto kernel  scope link  src 192.168.1.8 

I don't think we are doing anything too special with this box that we
would see a kernel issue no one else is seeing.  Can arp poisoning
cause this?

a dmesg after a clean reboot only gives:

Shorewall:all2all:REJECT:IN=ppp413 OUT= MAC= SRC=""
DST=10.20.1.1 LEN=60 TOS=0x00 PREC=0x00 TTL=254 ID=39752 PROTO=ICMP
TYPE=8 CODE=0 ID=25040 SEQ=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=72 TOS=0x00 PREC=0x00 TTL=126 ID=48363 PROTO=UDP
SPT=427 DPT=427 LEN=52 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=48 TOS=0x00 PREC=0x00 TTL=126 ID=48492 DF PROTO=TCP
SPT=36005 DPT=9220 WINDOW=16384 RES=0x00 SYN URGP=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=48 TOS=0x00 PREC=0x00 TTL=126 ID=48493 DF PROTO=TCP
SPT=36005 DPT=9220 WINDOW=16384 RES=0x00 SYN URGP=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=48 TOS=0x00 PREC=0x00 TTL=126 ID=48517 DF PROTO=TCP
SPT=36005 DPT=9220 WINDOW=16384 RES=0x00 SYN URGP=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=48 TOS=0x00 PREC=0x00 TTL=126 ID=48518 DF PROTO=TCP
SPT=33969 DPT=16398 WINDOW=16384 RES=0x00 SYN URGP=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=72 TOS=0x00 PREC=0x00 TTL=126 ID=48519 PROTO=UDP
SPT=427 DPT=427 LEN=52 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=48 TOS=0x00 PREC=0x00 TTL=126 ID=48522 DF PROTO=TCP
SPT=33969 DPT=16398 WINDOW=16384 RES=0x00 SYN URGP=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=48 TOS=0x00 PREC=0x00 TTL=126 ID=48526 DF PROTO=TCP
SPT=33969 DPT=16398 WINDOW=16384 RES=0x00 SYN URGP=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=48 TOS=0x00 PREC=0x00 TTL=126 ID=48614 DF PROTO=TCP
SPT=35790 DPT=9220 WINDOW=16384 RES=0x00 SYN URGP=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=48 TOS=0x00 PREC=0x00 TTL=126 ID=48630 DF PROTO=TCP
SPT=35790 DPT=9220 WINDOW=16384 RES=0x00 SYN URGP=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=48 TOS=0x00 PREC=0x00 TTL=126 ID=48x6 DF PROTO=TCP
SPT=35790 DPT=9220 WINDOW=16384 RES=0x00 SYN URGP=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=48 TOS=0x00 PREC=0x00 TTL=126 ID=48x8 DF PROTO=TCP
SPT=34718 DPT=16398 WINDOW=16384 RES=0x00 SYN URGP=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=48 TOS=0x00 PREC=0x00 TTL=126 ID=48663 DF PROTO=TCP
SPT=34718 DPT=16398 WINDOW=16384 RES=0x00 SYN URGP=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=48 TOS=0x00 PREC=0x00 TTL=126 ID=48679 DF PROTO=TCP
SPT=34718 DPT=16398 WINDOW=16384 RES=0x00 SYN URGP=0 

Shorewall:all2all:REJECT:IN=ppp160 OUT=eth3 SRC=""
DST=192.168.1.7 LEN=72 TOS=0x00 PREC=0x00 TTL=126 ID=48724 PROTO=UDP
SPT=427 DPT=427 LEN=52 

Kernel Version 2.6.18-8.1.6

Looking for any suggestions.

Marco

Andrei Kovacs wrote:

    On 10/25/07, Marco C. Coelho <maillist1@xxxxxxxxxxxxx> wrote:

       Looking into it further an ip route shows:

 x.0.0.0/8 via x.y.224.1 dev eth0  proto zebra  metric 20 equalize

 So the x.0.0.0 announce is coming into this box through OSPF  (zebra)

 The 169.254.0.0/16 is being automajically added through the sysconfig
network scripts.  I'm looking into why.

Add "NOZEROCONF=yes" in /etc/sysconfig/network and the 169.254.0.0/16
network won't be created anymore.

       In either case I still don't see why these entries would make the neighbor
table overflow.  Could it have been the previous fix to the hosts file?

 mc

 Alexandru Dragoi wrote:
 Marco C. Coelho wrote:

 the ip route with a grep for link returns:

snip** too long
x.y.x.198 dev ppp436 proto kernel scope link src 10.20.1.1
x.y.x.196 dev ppp421 proto kernel scope link src 10.20.1.1
x.y.x.197 dev ppp211 proto kernel scope link src 10.20.0.1
x.y.x.194 dev ppp13 proto kernel scope link src 10.20.1.1
x.y.x.192 dev ppp404 proto kernel scope link src 10.20.1.1
x.y.x.254 dev ppp194 proto kernel scope link src 10.20.1.1
x.y.x.253 dev ppp130 proto kernel scope link src 10.20.1.1
x.y.x.252 dev ppp243 proto kernel scope link src 10.20.1.1
x.y.x.249 dev ppp195 proto kernel scope link src 10.20.1.1
x.y.x.248 dev ppp254 proto kernel scope link src 10.20.1.1
x.y.x.247 dev ppp235 proto kernel scope link src 10.20.1.1
x.y.x.242 dev ppp78 proto kernel scope link src 10.20.1.1
x.y.x.240 dev ppp328 proto kernel scope link src 10.20.1.1
x.y.x.237 dev ppp44 proto kernel scope link src 10.20.1.1
x.y.x.236 dev ppp122 proto kernel scope link src 10.20.1.1
x.y.x.234 dev ppp316 proto kernel scope link src 10.20.1.1
x.y.x.232 dev ppp132 proto kernel scope link src 10.20.1.1
x.y.x.231 dev ppp104 proto kernel scope link src 10.20.0.1
x.y.x.226 dev ppp179 proto kernel scope link src 10.20.0.1
x.y.224.0/24 dev eth0 proto kernel scope link src x.y.224.8
192.168.1.0/24 dev eth3 proto kernel scope link src 192.168.1.8
169.254.0.0/16 dev eth3 scope link

 The one above must be deleted, many redhat-like distros attach
169.254.0.0/16.

 All the pppoe terminations (pppd) are shown, as well as the last three
subnets. I'll have to see where the 169.254.0.0/16 is coming from?

mc

Alexandru Dragoi wrote:

 Marco C. Coelho wrote:

 This box is doing a lot. It terminates 1000 PPPoE connections,
provides traffic shaping using TC/HTB, authenticates all users via
Radius. It also runs OSPF routing for the internal network. Looking
at a simple route output I see all the PPP connections coming through
the box, and due to the OSPF I also see the rest of my network
announcements. The only strange things are:

1. The last man working on this box had mistakenly edited the hosts
file and added the machine name and complete domain name to the local
host 127.0.0.1 name. It should only be pointed to the eth0
interface. I have changed this.

2. The route output is making an announcement

 x.0.0.0 argontech.net 255.0.0.0 UG 20
0 0 eth0

 This doesn't look dangerous for your problem, I was only talking about
directly connected networks:

# ip route |grep link

 My public IP space is a /20 within that space, not the whole Class A.
I have not found which box is announcing this within my network yet.

Jeff Welling wrote:

 On 10/23/07 06:56, Alexandru Dragoi wrote:

 What about checking your routing table? you may have link routes
for massive subnets (like 85.0.0.0/8 or 140.20.0.0/16). Some
programs prefer to use "standard" netmask of classes A and B.

 I'm betting that the OP has other things going on seeing has how
s/he mentioned PPPoE, which to my knowledge is a layer 2 protocol,
and thus not subject to typical routing scenarios. In essence the
OP could have thousands of PPPoE connections terminating on one
system with the ARP cache having to deal with where to send traffic
to which MAC address. There is not a lot of room for routing in such
a scenario.

 I agree with Peter's suggestion, arpd. I ran into the neighbor table
overflow problem recently, at the hands of our ISP. I was in the
process of recompiling the kernel and mucking with arpd (I couldn't
get it to run/start properly) when the problem disappeared as quickly
as it showed up. Lucky for me, this was some kind of ISP problem, I
was able to determine that much through `tcpdump -i X -n arpd`.

My 'two cents' is that you try arpd, I did a bit of looking when I
came across that problem and it seemed to be the last ditch effort
when changing the gc threshold had no effect. Wasn't able to confirm
that it worked for sure though.

Cheers.
_______________________________________________
LARTC mailing list
LARTC@xxxxxxxxxxxxxxx
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc

 _______________________________________________
LARTC mailing list
LARTC@xxxxxxxxxxxxxxx
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc

------------------------------------------------------------------------

_______________________________________________
LARTC mailing list
LARTC@xxxxxxxxxxxxxxx
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc

_______________________________________________
LARTC mailing list
LARTC@xxxxxxxxxxxxxxx
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc

_______________________________________________
LARTC mailing list
LARTC@xxxxxxxxxxxxxxx
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc

_______________________________________________
LARTC mailing list
LARTC@xxxxxxxxxxxxxxx
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc

Re: neighbor table overflow

Linux Advanced Routing and Traffic Control