On Thu, Aug 29, 2013 at 7:06 PM, Michael Neuling <mikey@xxxxxxxxxxx> wrote: > > Running on a POWER7 here with 32 threads (8 cores x 4 threads) I'm > getting some good improvements: That's *much* better than I get. But I literally just have a single socket with two cores (and HT, so four threads) in my test machine, so I really have a hard time getting any real contention. And the main advantage of the patch should be when you actually have CPU's spinning on that dentry d_lock. Also, on x86, there are no advantages to cmpxchg over a spinlock - they are both exactly one equally serializing instruction. If anything, cmpxchg is worse due to having a cache read before the write, and a few cycles slower anyway. So I actually expect the x86 code to slow down a tiny bit for the single-threaded case, although that should be hopefully unmeasurable. On POWER, you may have much less serialization for the cmpxchg. That may sadly be something we'll need to fix - the serialization between getting a lockref and checking sequence counts etc may need some extra work. So it may be that you are seeing unrealistically good numbers, and that we will need to add a memory barrier or two. On x86, due to the locked instruction semantics, that just isn't an issue. > The numbers move around about 10% from run to run. Please note that the whole "dentry hash chains may be better" for one run vs another, and that's something that will _persist_ between subsequent runs, so you may see "only 10% variability", but there may be a bigger picture variability that you're not noticing because you had to reboot in between. To be really comparable, you should really run the stupid benchmark after fairly equal boot up sequences. If the machine had been up for several days for one set of numbers, and freshly rebooted for the other, it can be a very unfair comparison. (I long ago had a nice "L1 dentry cache" patch that helped with the fact that the dentry chains *can* get long especially if you have tons of memory, and that helped with this kind of variability a lot - and improved performance too. It was slightly racy, though, which is why it never got merged). > powerpc patch below. I'm using arch_spin_is_locked() to implement > arch_spin_value_unlocked(). Your "slock" is of type "volatile unsigned int slock", so it may well cause those temporaries to be written to memory. It probably doesn't matter, but you may want to check that the result of "make lib/lockref.s" looks ok. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html