NAPI despises SMP.. Any SMP box we run NAPI on has major packet loss under high load.. So I find that the e1000 ITR works just as well And there is no reason for NAPI at this point. I will try your settings :) net.ipv4.route.secret_interval = 600 net.ipv4.route.min_adv_mss = 256 net.ipv4.route.min_pmtu = 552 net.ipv4.route.mtu_expires = 600 net.ipv4.route.gc_elasticity = 4 net.ipv4.route.error_burst = 500 net.ipv4.route.error_cost = 100 net.ipv4.route.redirect_silence = 2048 net.ipv4.route.redirect_number = 9 net.ipv4.route.redirect_load = 2 net.ipv4.route.gc_interval = 600 net.ipv4.route.gc_timeout = 15 net.ipv4.route.gc_min_interval = 0 net.ipv4.route.max_size = 32768 net.ipv4.route.gc_thresh = 2000 net.ipv4.route.max_delay = 10 net.ipv4.route.min_delay = 5 Current settings.... Rtstat output: size IN: hit tot mc no_rt bcast madst masrc OUT: hit tot mc GC: tot ignored goal_miss ovrf 2010 9014 14039 0 0 0 0 0 0 6 2 14038 0 49 0 2008 8675 13999 0 0 0 1 0 1 5 2 13992 0 56 0 2002 8529 16484 0 0 0 1 0 0 7 2 16483 0 43 0 2009 8549 15304 0 0 0 0 0 1 10 2 15303 0 55 0 2007 8491 16118 0 0 0 0 0 0 10 2 16117 0 50 0 2024 8219 18306 0 0 0 1 0 0 7 2 18309 0 14 0 2005 8586 15536 0 0 0 0 0 0 9 2 15536 0 42 0 2007 8804 15797 0 0 0 0 0 0 7 2 15796 0 42 0 2012 8535 16519 0 0 0 1 0 0 7 2 16518 0 28 0 2004 8348 15709 0 0 0 0 1 0 8 2 15707 0 42 0 ... 2043 8600 18278 0 0 0 0 0 0 12 2 18285 0 15 0 2030 8631 17731 0 0 0 1 0 0 9 2 17737 0 7 0 2002 8489 14653 0 0 0 1 0 2 5 2 14650 0 35 0 2015 8147 15004 0 0 0 0 0 0 9 2 15003 0 57 0 2015 8352 17303 0 0 0 2 0 0 8 2 17308 0 7 0 2025 8451 16768 0 0 0 0 0 0 6 2 16768 0 35 0 2013 8531 16464 0 0 0 0 0 0 13 2 16476 0 7 0 2013 8117 15202 0 0 0 1 1 0 7 2 15198 0 35 0 size IN: hit tot mc no_rt bcast madst masrc OUT: hit tot mc GC: tot ignored goal_miss ovrf 2019 7913 15054 0 0 0 1 0 0 9 2 15057 0 42 0 2008 8258 16019 0 0 0 0 0 1 9 2 16020 0 43 0 2025 8211 17897 0 0 0 1 0 0 5 2 17902 0 0 0 CPU NORMAL: CPU0 states: 36.0% user, 29.0% system, 0.0% nice, 33.0% idle CPU1 states: 18.0% user, 61.0% system, 0.0% nice, 19.0% idle CPU0 states: 21.0% user, 44.0% system, 0.0% nice, 35.0% idle CPU1 states: 18.0% user, 47.0% system, 0.0% nice, 35.0% idle 3 root 10 -1 0 0 0 SW< 0.0 0.0 35:29 ksoftirqd_CPU0 4 root 10 -1 0 0 0 SW< 0.0 0.0 35:35 ksoftirqd_CPU1 Rtstat under light juno: 2315 7955 51691 0 0 0 1 1 1 5 1 51695 0 0 0 2336 6620 47387 0 0 0 1 0 1 5 1 47393 0 0 0 2371 5630 49726 0 0 0 0 0 1 12 2 49737 0 0 0 2372 5420 53458 0 0 0 1 0 0 2 1 53460 0 0 0 2369 4891 48983 0 0 0 0 0 1 5 2 48988 0 0 0 2389 4529 50525 0 0 0 0 1 1 8 1 50532 0 0 0 2334 4645 49092 0 0 1 1 0 0 1 1 49093 0 0 0 2358 5033 48971 0 0 0 1 0 1 6 2 48977 0 0 0 2366 4864 51411 0 0 0 2 0 1 8 1 51419 0 0 0 2370 5035 49444 0 0 0 0 0 0 4 2 49448 0 0 0 size IN: hit tot mc no_rt bcast madst masrc OUT: hit tot mc GC: tot ignored goal_miss ovrf 2391 5328 49098 0 0 0 1 0 3 12 3 49110 0 0 0 2363 5586 50687 0 0 0 2 0 0 7 1 50693 0 0 0 2361 4571 49243 0 0 0 0 0 0 2 1 49243 0 0 0 2356 5758 56664 0 0 1 1 0 1 5 1 56666 0 0 0 2375 5581 62098 0 0 0 2 0 0 8 2 62103 0 0 0 2393 3895 50762 0 0 0 1 0 0 5 0 50764 0 0 0 2335 4066 56659 0 0 0 1 0 0 10 2 56667 0 0 0 2315 3607 49990 0 0 0 1 0 0 4 1 49992 0 0 0 2339 4369 54149 0 0 0 1 0 0 7 1 54153 0 0 0 CPU under JUNO: CPU0 states: 0.0% user, 99.3% system, 0.2% nice, 0.0% idle CPU1 states: 0.2% user, 99.3% system, 0.1% nice, 0.0% idle 4 root 14 -1 0 0 0 SW< 21.0 0.0 35:33 ksoftirqd_CPU1 3 root 15 -1 0 0 0 SW< 20.1 0.0 35:27 ksoftirqd_CPU0 This is 10mbit of juno....... Or around 9.6 or so... RTS normal with 8000 thresh: size IN: hit tot mc no_rt bcast madst masrc OUT: hit tot mc GC: tot ignored goal_miss ovrf 8003 11474 9076 0 0 0 2 0 0 4 2 9071 0 10 0 8010 11425 9205 0 0 0 0 0 0 7 2 9203 0 14 0 8006 11393 12516 0 0 0 1 0 4 5 0 12509 0 20 0 8005 12082 9188 0 0 0 2 0 0 5 2 9184 0 14 0 8004 11447 8893 0 0 0 0 0 0 8 2 8890 0 12 0 8004 12346 8898 0 0 0 1 0 2 5 2 8891 0 10 0 8003 11557 8944 0 0 0 2 0 1 7 1 8942 0 14 0 8004 12812 9890 0 0 0 0 0 1 5 1 9878 0 16 0 8004 12166 11363 0 0 0 1 0 2 3 2 11349 0 23 0 8012 11933 8881 0 0 0 2 0 0 6 2 8874 0 15 0 8003 11938 9024 0 0 0 0 0 1 5 1 9017 0 12 0 8003 12107 8682 0 0 0 1 0 2 3 2 8674 0 13 0 8008 11328 8945 0 0 0 1 0 2 6 1 8942 0 10 0 CPU: CPU0 states: 0.0% user, 50.0% system, 0.0% nice, 49.0% idle CPU1 states: 1.0% user, 57.0% system, 0.0% nice, 40.0% idle CPU0 states: 0.0% user, 27.0% system, 0.0% nice, 72.0% idle CPU1 states: 0.0% user, 41.0% system, 0.0% nice, 58.0% idle 3 root 12 -1 0 0 0 SW< 0.0 0.0 35:29 ksoftirqd_CPU0 4 root 9 -1 0 0 0 SW< 0.0 0.0 35:35 ksoftirqd_CPU1 I've mucked with TONNnss of settings.. I've even had the route-cache up to over 600,000 entries and the CPU still has room left for more.. It can't possibly be the size of the cache, it simply has to be the constant creation and teardown of entries .. I can't hit anywhere NEAR 100kpps On this router with the amount of load on it.. The routing table: ip ro ls | wc 516 2598 21032 Doesn't have too much in it.. It's running bgp but im not taking the full routes right now.. We will later though. There are some ip rules Also some netfilters iptables-save | wc 1154 7658 46126 Of course there isn't 1154 entries because some of that is the chains and things but there are a lot of rules in netfilter also.. Everything seems to slow it down :/ especially the mangle table.. If I add 1000 entries to the mangle table in netfilter it uses massive cpu .. Netfilter seems to be a hog. Like I said I've tested this with NO netfilter and nothing else on a test box except for the kernel, e1000 , ITR set to ~4000 and all sorts of changing the settings and I still can't hit 100kpps routing with juno-z Paul xerox@foonet.net http://www.httpd.net -----Original Message----- From: Simon Kirby [mailto:sim@netnation.com] Sent: Monday, June 09, 2003 6:19 PM To: CIT/Paul Cc: 'David S. Miller'; hadi@shell.cyberus.ca; fw@deneb.enyo.de; netdev@oss.sgi.com; linux-net@vger.kernel.org Subject: Re: Route cache performance under stress On Mon, Jun 09, 2003 at 03:38:30PM -0400, CIT/Paul wrote: > gc_elasticity:1 > gc_interval:600 > gc_min_interval:1 > gc_thresh:60000 > gc_timeout:15 > max_delay:10 > max_size:512000 ^^^ EEP, no! Even the default of 65536 is too big. No wonder you have no CPU left. This should never be bigger than 65536 (unless the hash is increased), but even then it should be set smaller and the GC interval should be fixed. With a table that large, it's going to be walking the buckets all of the time. > I've tried other settings, secret-interval 1 which seems to 'flush' > the cache every second or 60 seconds as I have it here.. That's only for permutating the hash table to avoid remote hash exploits. Ideally, you don't want anything clearing the route cache except for the regular garbage collection (where the gc_elasticity controls how much of it gets nuked). > If I have secret interval set to 1 the GC never runs because the cache > never gets > my gc thresh.. I've also tried this with Gc_thresh 2000 > and more aggressive settings (timeout 5, interval 10).. Also tried > with max_size 16000 but juno pegs the route cache And I get massive > amounts of dst_cache_overflow messages .. Try setting gc_min_interval to 0 and gc_elasticity to 4 (so that it doesn't entirely nuke it all the time, but so that it runs fairly often and prunes quite a bit). gc_min_interval:0 will actually make it clear as it allocates, if I remember correctly. > This is 'normal' traffic on the router (using the rtstat program) > > ./rts -i 1 > size IN: hit tot mc no_rt bcast madst masrc OUT: hit tot > mc GC: tot ignored goal_miss ovrf > 59272 26954 1826 0 0 0 0 0 6 0 > 0 0 0 0 0 Yes, your route cache is way too large for the hash. Ours looks like this: [sroot@r2:/root]# rtstat -i 1 size IN: hit tot mc no_rt bcast madst masrc OUT: hit tot mc 870721946 16394 1013 8 4 4 0 0 38 12 0 870722937 16278 1007 8 0 10 0 0 32 6 0 870723935 16362 999 5 0 6 0 0 34 8 0 870725083 16483 1158 1 0 0 0 2 26 6 0 870726047 16634 974 0 0 4 0 0 42 0 0 870726168 14315 2338 13 10 8 0 0 34 44 2 870726168 14683 1383 0 8 2 0 0 30 12 2 870726864 16172 1155 0 6 2 0 0 28 4 0 870728079 17842 1234 0 0 0 0 0 28 12 0 870729106 17545 1036 2 0 2 0 0 30 6 0 ...Hmm, the size is a bit off there. I'm not sure what that's all about. Did you have to hack on rtstat.c at all? Alternative: [sroot@r2:/root]# while (1) [sroot@r2:(while)]# sleep 1 [sroot@r2:(while)]# ip -o route show cache | wc -l [sroot@r2:(while)]# end 8064 8706 9299 9939 10277 10857 11426 11731 12328 12796 13096 13623 1139 2712 4233 561 2468 3948 5075 5459 6114 6768 7502 7815 8303 8969 9602 10090 10566 11194 11765 11987 12678 12920 13563 14136 14693 2336 3652 4814 5954 6449 6741 7412 8036 ....Hmm, even that is growing a bit large. Pfft. I guess we were doing less traffic last time I checked this. :) Maybe you have a bit more traffic than us in normal operation and it's growing faster because of that. Still, with a gc_elasticity of 1 it should be clearing it out very quickly. ...Though I just tried that, and it's not. In fact, the gc_elasticity doesn't seem to be making much of a difference at all. The only thing that seems to really change it is if I set gc_min_interval to 0: [sroot@r2:/proc/sys/net/ipv4/route]# echo 0 > gc_min_interval [sroot@r2:/proc/sys/net/ipv4/route]# while ( 1 ) [sroot@r2:(while)]# sleep 1 [sroot@r2:(while)]# ip -o route show cache | wc -l [sroot@r2:(while)]# end 9674 9547 9678 9525 9625 9544 9385 497 2579 3820 4083 4099 4068 4054 4089 4095 4137 4072 4071 4137 2141 3414 4044 2487 3759 4047 4085 4092 4156 4089 4008 475 2497 3729 4146 4085 4116 It seems to regulate it after it gets cleared the first time. If I set gc_elasticity to 1 it seems to bounce around a lot more -- 4 is much smoother. It didn't seem to make a difference with gc_min_interval set to 1, though... hmmm. We've been running normally with gc_min_interval set to 1, but it looks like the BGP table updates have kept the cache from growing too large. > Check what happens when I load up juno.. Yeah... Juno's just going to hit it harder and show the problems with it having to walk through such large hash buckets. How big is your routing table on this box? Is it running BGP? > slammed at 100% by the ksoftirqds. This is using e1000 with interrups > limited to ~ 4000/second (ITR), no NAPI.. NAPI messes it up big time > and drops more packets than without :> Hmm, that's weird. It works quite well here on a single CPU box with tg3 cards. Simon- - : send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html