Le jeudi 16 dÃcembre 2010 Ã 15:04 +0100, Jesper Dangaard Brouer a Ãcrit : > On Tue, 2010-12-14 at 17:24 +0100, Eric Dumazet wrote: > > Le mardi 14 dÃcembre 2010 Ã 17:09 +0100, Jesper Dangaard Brouer a Ãcrit : > > > On Tue, 2010-12-14 at 16:31 +0100, Eric Dumazet wrote: > > > > Le mardi 14 dÃcembre 2010 Ã 15:46 +0100, Jesper Dangaard Brouer a > > > > Ãcrit : > > > > > I'm experiencing RX packet drops during call to iptables, on my > > > > > production servers. > > > > > > > > > > Further investigations showed, that its only the CPU executing the > > > > > iptables command that experience packet drops!? Thus, a quick fix was > > > > > to force the iptables command to run on one of the idle CPUs (This can > > > > > be achieved with the "taskset" command). > > > > > > > > > > I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled). We > > > > > only use 8 CPUs due to a multiqueue limitation of 8 queues in the > > > > > 1Gbit/s NICs (82576 chips). CPUs 0 to 7 is assigned for packet > > > > > processing via smp_affinity. > > > > > > > > > > Can someone explain why the packet drops only occur on the CPU > > > > > executing the iptables command? > > > > > > > > > > > > > It blocks BH > > > > > > > > take a look at commits : > > > > > > > > 24b36f0193467fa727b85b4c004016a8dae999b9 > > > > netfilter: {ip,ip6,arp}_tables: dont block bottom half more than > > > > necessary > > > > > > > > 001389b9581c13fe5fc357a0f89234f85af4215d > > > > netfilter: {ip,ip6,arp}_tables: avoid lockdep false positiv > <... cut ...> > > > > > > Looking closer at the two combined code change, I see that the code path > > > has been improved (a bit), as the local BH is only disabled inside the > > > for_each_possible_cpu(cpu). Before local_bh was disabled for the hole > > > function. Guess I need to reproduce this in my testlab. > > > To do some further investigation into the unfortunate behavior of the > iptables get_counters() function I started to use "ftrace". This is a > really useful tool (thanks Steven Rostedt). > > # Select the tracer "function_graph" > echo function_graph > /sys/kernel/debug/tracing/current_tracer > > # Limit the number of function we look at: > echo local_bh_\* > /sys/kernel/debug/tracing/set_ftrace_filter > echo get_counters >> /sys/kernel/debug/tracing/set_ftrace_filter > > # Enable tracing while calling iptables > cd /sys/kernel/debug/tracing > echo 0 > trace > echo 1 > tracing_enabled; > taskset 1 iptables -vnL > /dev/null ; > echo 0 > tracing_enabled > cat trace | less > > > The reduced output: > > # tracer: function_graph > # > # CPU DURATION FUNCTION CALLS > # | | | | | | | > 2) 2.772 us | local_bh_disable(); > .... > 0) 0.228 us | local_bh_enable(); > 0) | get_counters() { > 0) 0.232 us | local_bh_disable(); > 0) 7.919 us | local_bh_enable(); > 0) ! 109467.1 us | } > 0) 2.344 us | local_bh_disable(); > > > The output show that we spend no less that 100 ms with local BH > disabled. So, no wonder that this causes packet drops in the NIC > (attached to this CPU). > > My iptables rule set in question is also very large, it contains: > Chains: 20929 > Rules: 81239 > > The vmalloc size is approx 19 MB (19.820.544 bytes) (see > /proc/vmallocinfo). Looking through vmallocinfo I realized that > even-though I only have 16 CPUs, there is 32 allocated rulesets > "xt_alloc_table_info" (for the filter table). Thus, I have approx > 634MB iptables filter rules in the kernel, half of which is totally > unused. Boot your machine with : "maxcpus=16 possible_cpus=16", it will be much better ;) > > Guess this is because we use: "for_each_possible_cpu" instead of > "for_each_online_cpu". (Feel free to fix this, or point me to some > documentation of this CPU hotplug stuff... I see we are missing > get_cpu() and put_cpu() a lot of places). Are you really using cpu hotplug ? If not, the "maxcpus=16 possible_cpus=16" trick should be enough for you. > > > The GOOD NEWS, is that moving the local BH disable section into the > "for_each_possible_cpu" fixed the problem with packet drops during > iptables calls. > > I wanted to profile with ftrace on the new code, but I cannot get the > measurement I want. Perhaps Steven or Acme can help? > > Now I want to measure the time used between the local_bh_disable() and > local_bh_enable, within the loop. I cannot figure out howto do that? > The new trace looks almost the same as before, just a lot of > local_bh_* inside the get_counters() function call. > > Guess is that the time spend is: 100 ms / 32 = 3.125 ms. > yes, approximatly. In order to accelerate, you could eventually pre-fill cpu cache before the local_bh_disable() (just reading the table). So that critical section is short, because mostly in your cpu cache. > Now I just need to calculate, how large a NIC buffer I need to buffer > 3.125 ms at 1Gbit/s. > > 3.125 ms * 1Gbit/s = 390625 bytes > > Can this be correct? > > How much buffer does each queue have in the 82576 NIC? > (Hope Alexander Duyck can answer this one?) > -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html