Re: Possible regression: Packet drops during iptables calls

Jesper Dangaard Brouer <hawk@xxxxxxx> · Thu, 16 Dec 2010 15:04:26 +0100

On Tue, 2010-12-14 at 17:24 +0100, Eric Dumazet wrote:
> Le mardi 14 dÃcembre 2010 Ã 17:09 +0100, Jesper Dangaard Brouer a Ãcrit :
> > On Tue, 2010-12-14 at 16:31 +0100, Eric Dumazet wrote:
> > > Le mardi 14 dÃcembre 2010 Ã 15:46 +0100, Jesper Dangaard Brouer a
> > > Ãcrit :
> > > > I'm experiencing RX packet drops during call to iptables, on my
> > > > production servers.
> > > > 
> > > > Further investigations showed, that its only the CPU executing the
> > > > iptables command that experience packet drops!?  Thus, a quick fix was
> > > > to force the iptables command to run on one of the idle CPUs (This can
> > > > be achieved with the "taskset" command).
> > > > 
> > > > I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled).  We
> > > > only use 8 CPUs due to a multiqueue limitation of 8 queues in the
> > > > 1Gbit/s NICs (82576 chips).  CPUs 0 to 7 is assigned for packet
> > > > processing via smp_affinity.
> > > > 
> > > > Can someone explain why the packet drops only occur on the CPU
> > > > executing the iptables command?
> > > > 
> > > 
> > > It blocks BH
> > > 
> > > take a look at commits :
> > > 
> > > 24b36f0193467fa727b85b4c004016a8dae999b9
> > > netfilter: {ip,ip6,arp}_tables: dont block bottom half more than
> > > necessary 
> > > 
> > > 001389b9581c13fe5fc357a0f89234f85af4215d
> > > netfilter: {ip,ip6,arp}_tables: avoid lockdep false positiv
<... cut ...>
> > 
> > Looking closer at the two combined code change, I see that the code path
> > has been improved (a bit), as the local BH is only disabled inside the
> > for_each_possible_cpu(cpu).  Before local_bh was disabled for the hole
> > function.  Guess I need to reproduce this in my testlab.

To do some further investigation into the unfortunate behavior of the
iptables get_counters() function I started to use "ftrace".  This is a
really useful tool (thanks Steven Rostedt).

 # Select the tracer "function_graph"
 echo function_graph > /sys/kernel/debug/tracing/current_tracer

 # Limit the number of function we look at:
 echo local_bh_\*  >   /sys/kernel/debug/tracing/set_ftrace_filter
 echo get_counters >>  /sys/kernel/debug/tracing/set_ftrace_filter

 # Enable tracing while calling iptables
 cd /sys/kernel/debug/tracing
 echo 0 > trace
 echo 1 > tracing_enabled;
   taskset 1 iptables -vnL > /dev/null ;
 echo 0 > tracing_enabled
 cat trace | less

The reduced output:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
  2)   2.772 us    |  local_bh_disable();
....
  0)   0.228 us    |  local_bh_enable();
  0)               |  get_counters() {
  0)   0.232 us    |    local_bh_disable();
  0)   7.919 us    |    local_bh_enable();
  0) ! 109467.1 us |  }
  0)   2.344 us    |  local_bh_disable();

The output show that we spend no less that 100 ms with local BH
disabled.  So, no wonder that this causes packet drops in the NIC
(attached to this CPU).

My iptables rule set in question is also very large, it contains:
 Chains: 20929
 Rules: 81239

The vmalloc size is approx 19 MB (19.820.544 bytes) (see
/proc/vmallocinfo).  Looking through vmallocinfo I realized that
even-though I only have 16 CPUs, there is 32 allocated rulesets
"xt_alloc_table_info" (for the filter table). Thus, I have approx
634MB iptables filter rules in the kernel, half of which is totally
unused.

Guess this is because we use: "for_each_possible_cpu" instead of
"for_each_online_cpu". (Feel free to fix this, or point me to some
documentation of this CPU hotplug stuff... I see we are missing
get_cpu() and put_cpu() a lot of places).

The GOOD NEWS, is that moving the local BH disable section into the
"for_each_possible_cpu" fixed the problem with packet drops during
iptables calls.

I wanted to profile with ftrace on the new code, but I cannot get the
measurement I want. Perhaps Steven or Acme can help?

Now I want to measure the time used between the local_bh_disable() and
local_bh_enable, within the loop.  I cannot figure out howto do that?
The new trace looks almost the same as before, just a lot of
local_bh_* inside the get_counters() function call.

 Guess is that the time spend is: 100 ms / 32 = 3.125 ms.

Now I just need to calculate, how large a NIC buffer I need to buffer
3.125 ms at 1Gbit/s.

 3.125 ms *  1Gbit/s = 390625 bytes

Can this be correct?

How much buffer does each queue have in the 82576 NIC?
(Hope Alexander Duyck can answer this one?)

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html