On Thu, Aug 29, 2013 at 12:25 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > Hmm. I can see it, but it turns out that for normal pathname walking, > one of the main stumbling blocks is the RCU case of complete_walk(), > which cannot be done with the lockless lockref model. > [... snip ...] Ok, here's a patch for people to try out if they want to. It's tested, and works for me, but it is - like the two preparatory patches I already applied - not really based on Waiman's patches except from a conceptual standpoint. For architecture people (ie Ben, if you want to try this on ppc64), the thing that it needs from an architecture: - the raw_spinlock_t and the "unsigned long" needs to fit in a u64. This is normally true anyway, but if your architecture code has a big spinlock, you can't use it. - the architecture needs to support "cmpxchg()" on that "u64" type (so x86-32 does *not* use this, although we could teach it to do so by teaching x86-32 to use "cmpxchg8b" for the 8-byte case) A 64-bit architecture needs to support cmpxchg() on an u64 anyway, so this is always true on 64-bit. It _can_ be true on 32-bit architectures too. - the architecture needs to implement a simple "arch_spin_value_unlocked()" macro, which takes a raw_spinlock_t value and says whether it is unlocked or not. This is a new macro/function. It's generally a one-liner. For the x86 ticket locks, for example, the test is simply "lock.tickets.head == lock.tickets.tail". - the architecture code needs to let the generic code know that it supports all this by doing a "select ARCH_USE_CMPXCHG_LOCKREF" Add it to your Kconfig file in the appropriate place. You do *not* need to worry about LOCKDEP etc, you only need to worry about your own architecture details above. That's it. If it does that, the lockref code will use the cmpxchg optimization when appropriate (ie spinlock debugging is not turned on etc etc). For Waiman: your patch had that adaptive waiting thing, and "wait for unlocked" code, and I threw all that away. I didn't like it, and the only reason for it existing was that the spinlock could be taken in a hot path, which I felt was against the whole point of this "lockref" thing. So I fixed the VFS layer instead. With dput() and friends using lockrefs, the only thing remaining in the hot RCU dentry lookup path was the nasty __d_rcu_to_refcount() thing in complete_walk(). I rewrote that to locklessly increment the refcount when it was nonzero, and get the lock if it was zero, and that all seems to work fine. And once the only case that is relevant for the fast-path is "d_lock is unlocked", all your games with waiting for the spinlock to be released are unnecessary. Making everything much simpler. If the spinlock isn't unlocked, we always kick out to the fallback case (with real locking). NOTE! My test-case was very small and simple, so it may not trigger other cases that might trigger d_lock in a hotpath. Anything that kicks us out of rcu mode (like a symlink, for example) will trigger "unlazy_walk()", and I didn't do the same thing there. So there's still details like that to sort out, but I very much think this whole "only an unlocked spinlock is a fastpath" is the right approach. My simple testing shows that this has about the same best-case performance, and the 15% _raw_spin_lock load I was able to trigger is totally gone. That doesn't make things *faster* for me (because the cost of the cmpxchg is pretty much comparable to the cost of the spinlocks), but the big difference is the contended behavior where we don't actually have to wait for the spinlock, we can just locklessly increment the counter. I can't trigger the CPU-eating contention case on my single-socket system, which is why I'm sending out this patch for testing (despite it not having that unlazy_walk() thing etc). Also note that this is one single patch, not split up. Again, that's because what I'm really hoping to get is just "does this fix the contention-case on the 80-core monster machine that I don't have access to?" Side note: the whole cmpxchg() loop is written to basically generate the perfect cmpxchg sequence on x86. The assembly results actually look pretty good. I can't take advantage of the eflag setting of the instruction, because gcc inline asms don't support that (even with "asm goto" - I'd need to have output values for that, and "asm goto" does now allow that). So there's one extra "cmpq" instruction, and gcc makes a mess of "add 1 to structure member in high bytes of a 64-bit stricture", but it's actually fairly readable and short assembly code, which was *not* true of the original patches. Waiman? Mind looking at this and testing? Linus
Attachment:
patch.diff
Description: Binary data