Re: [PATCH[] netfilter: use per-cpu reader-writer lock (v0.7)

Ingo Molnar <mingo@xxxxxxx> · Fri, 17 Apr 2009 01:04:57 +0200

* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Thu, 16 Apr 2009, Stephen Hemminger wrote:
> >
> > This version of x_tables (ip/ip6/arp) locking uses a per-cpu
> > rwlock that can be nested. It is sort of like earlier brwlock 
> > (fast reader, slow writer). The locking is isolated so future improvements
> > can concentrate on measuring/optimizing xt_table_info_lock. I tried
> > other versions based on recursive spin locks and sequence counters and 
> > for me, the risk of inventing own locking primitives not worth it at this time.
> 
> This is stil scary.
> 
> Do we guarantee that read-locks nest in the presense of a waiting 
> writer on another CPU? Now, I know we used to (ie readers always 
> nested happily with readers even if there were pending writers), 
> and then we broke it. I don't know that we ever unbroke it.
> 
> IOW, at least at some point we deadlocked on this (due to trying 
> to be fair, and not lettign in readers while earlier writers were 
> waiting):
> 
> 	CPU#1			CPU#2
> 
> 	read_lock
> 
> 				write_lock
> 				.. spins with write bit set, waiting for
> 				   readers to go away ..
> 
> 	recursive read_lock
> 	.. spins due to the write bit
> 	   being. BOOM: deadlock  ..
> 
> Now, I _think_ we avoid this, but somebody should double-check.

This is a narrow special-case where the spin-rwlock is safe, and the 
rwsem is unsafe.

But it should work for rwlocks - it always worked and the networking 
code always relied on that AFAIK.

Here's the x86 assembly code of the write-side slowpath:

ENTRY(__write_lock_failed)
        CFI_STARTPROC
        LOCK_PREFIX
        addl $RW_LOCK_BIAS,(%rdi)
1:      rep
        nop
        cmpl $RW_LOCK_BIAS,(%rdi)
        jne 1b
        LOCK_PREFIX
        subl $RW_LOCK_BIAS,(%rdi)
        jnz  __write_lock_failed
        ret
        CFI_ENDPROC

the fastpath decreased the value with RW_LOCK_BIAS, and when we 
enter this function we undo that effect by adding RW_LOCK_BIAS. Then 
we spin (without signalling our write-intent) passively until the 
count reaches RW_LOCK_BIAS. Then we try to lock it again and bring 
it to zero (meaning no other readers or writers - we got the lock).

This is pretty much the most unfair strategy possible for writers - 
but this is how rwlocks always behaved - and they do so mostly for 
recursive use within networking.

This is why the tasklist_lock was always so suspect to insane 
starvation symptoms on really large SMP systems, and this is why 
write_lock_irq(&tasklist_lock) was always a dangerous operation to 
do. (it can spin for a _long_ time with irqs off.)

It's not the most optimal of situations. Some patches are in the 
works to fix the irqs-off artifact (on ia64 - no x86 patches yet 
AFAICS) - but that's just papering it over.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html