* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Thu, 16 Apr 2009, Stephen Hemminger wrote: > > > > This version of x_tables (ip/ip6/arp) locking uses a per-cpu > > rwlock that can be nested. It is sort of like earlier brwlock > > (fast reader, slow writer). The locking is isolated so future improvements > > can concentrate on measuring/optimizing xt_table_info_lock. I tried > > other versions based on recursive spin locks and sequence counters and > > for me, the risk of inventing own locking primitives not worth it at this time. > > This is stil scary. > > Do we guarantee that read-locks nest in the presense of a waiting > writer on another CPU? Now, I know we used to (ie readers always > nested happily with readers even if there were pending writers), > and then we broke it. I don't know that we ever unbroke it. > > IOW, at least at some point we deadlocked on this (due to trying > to be fair, and not lettign in readers while earlier writers were > waiting): > > CPU#1 CPU#2 > > read_lock > > write_lock > .. spins with write bit set, waiting for > readers to go away .. > > recursive read_lock > .. spins due to the write bit > being. BOOM: deadlock .. > > Now, I _think_ we avoid this, but somebody should double-check. This is a narrow special-case where the spin-rwlock is safe, and the rwsem is unsafe. But it should work for rwlocks - it always worked and the networking code always relied on that AFAIK. Here's the x86 assembly code of the write-side slowpath: ENTRY(__write_lock_failed) CFI_STARTPROC LOCK_PREFIX addl $RW_LOCK_BIAS,(%rdi) 1: rep nop cmpl $RW_LOCK_BIAS,(%rdi) jne 1b LOCK_PREFIX subl $RW_LOCK_BIAS,(%rdi) jnz __write_lock_failed ret CFI_ENDPROC the fastpath decreased the value with RW_LOCK_BIAS, and when we enter this function we undo that effect by adding RW_LOCK_BIAS. Then we spin (without signalling our write-intent) passively until the count reaches RW_LOCK_BIAS. Then we try to lock it again and bring it to zero (meaning no other readers or writers - we got the lock). This is pretty much the most unfair strategy possible for writers - but this is how rwlocks always behaved - and they do so mostly for recursive use within networking. This is why the tasklist_lock was always so suspect to insane starvation symptoms on really large SMP systems, and this is why write_lock_irq(&tasklist_lock) was always a dangerous operation to do. (it can spin for a _long_ time with irqs off.) It's not the most optimal of situations. Some patches are in the works to fix the irqs-off artifact (on ia64 - no x86 patches yet AFAICS) - but that's just papering it over. Ingo -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html