On Thu, Apr 16, 2009 at 03:53:15PM +0200, Patrick McHardy wrote: > Eric Dumazet wrote: >> Stephen Hemminger a écrit : >>> This is an alternative version of ip/ip6/arp tables locking using >>> per-cpu locks. This avoids the overhead of synchronize_net() during >>> update but still removes the expensive rwlock in earlier versions. >>> >>> The idea for this came from an earlier version done by Eric Dumazet. >>> Locking is done per-cpu, the fast path locks on the current cpu >>> and updates counters. The slow case involves acquiring the locks on >>> all cpu's. This version uses RCU for the table->base reference >>> but per-cpu-lock for counters. > >> This version is a regression over 2.6.2[0-9], because of two points >> 1) Much more atomic ops : >> Because of additional >>> + spin_lock(&__get_cpu_var(ip_tables_lock)); >>> ADD_COUNTER(e->counters, ntohs(ip->tot_len), 1); >>> + spin_unlock(&__get_cpu_var(ip_tables_lock)); >> added on each counter updates. >> On many setups, each packet coming in or out of the machine has >> to update between 2 to 20 rule counters. So to avoid *one* atomic ops >> of read_unlock(), this v4 version adds 2 to 20 atomic ops... > > I agree, this seems to be a step backwards. > >> I still not see the problem between the previous version (2.6.2[0-8]) that >> had a central >> rwlock, that hurted performance on SMP because of cache line ping pong, >> and the solution >> having one rwlock per cpu. >> We wanted to reduce the cache line ping pong first. This *is* the hurting >> point, >> by an order of magnitude. > > Dave doesn't seem to like the rwlock approach. Well, we don't really need an rwlock, especially given that we really don't want two "readers" incrementing the same counter concurrently. A safer approach would be to maintain a flag in the task structure tracking which (if any) of the per-CPU locks you hold. Also maintain a recursion-depth counter. If the flag says you don't already hold the lock, set it and acquire the lock. Either way, increment the recursion-depth counter: if (current->netfilter_lock_held != cur_cpu) { BUG_ON(current->netfilter_lock_held != CPU_NONE); spin_lock(per_cpu(..., cur_cpu)); current->netfilter_lock_held = cur_cpu; } current->netfilter_lock_nesting++; And reverse the process to unlock: if (--current->netfilter_lock_nesting == 0) { spin_unlock(per_cpu(..., cur_cpu)); current->netfilter_lock_held = CPU_NONE; } > I don't see a way to > do anything asynchronously like call_rcu() to improve this, so to > bring up one of Stephens suggestions again: > >>> > * use on_each_cpu() somehow to do grace periood? > > We could use this to replace the counters, presuming it is > indeed faster than waiting for a RCU grace period. One way to accomplish this is to take Mathieu Desnoyers's user-level RCU implementation and drop it into the kernel, replacing the POSIX signal handling with on_each_cpu(), smp_call_function(), or whatever. >> 2) Second problem : Potential OOM >> About freeing old rules with call_rcu() and/or schedule_work(), this is >> going >> to OOM pretty fast on small appliances with basic firewall setups loading >> rules one by one, as done by original topic reporter. >> We had reports from guys using linux with 4MB of available ram (French >> provider free.fr on >> their applicance box), and we had to use SLAB_DESTROY_BY_RCU thing on >> conntrack >> to avoid OOM for their setups. We dont want to use call_rcu() and queue >> 100 or 200 vfree(). > > Agreed. This is not a real problem be handled by doing a synchronize_rcu() every so often as noted in a prior email elsewhere in this thread: call_rcu(...); if (++count > 50) { synchronize_rcu(); count = 0; } This choice of constant would reduce the grace-period pain to 2% of the full effect, which should be acceptable, at least if I remember the original problem report of 0.2 seconds growing to 6.0 seconds -- this would give you: (6.0-0.2)/50+0.2 = .316 I would argue that 100 milliseconds is an OK penalty for a deprecated feature. But of course the per-CPU lock approach should avoid even that penalty, albeit at some per-packet penalty. However, my guess is that this per-packet penalty is not measureable at the system level. And if the penalty of a single uncontended lock -is- measureable, I will be very quick to offer my congratulations, at least once I get my jaw off my keyboard. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html