On Mon, Jan 07, 2013 at 03:35:24PM -0800, Andrew Morton wrote: > On Mon, 7 Jan 2013 15:21:15 -0800 > Kent Overstreet <koverstreet@xxxxxxxxxx> wrote: > > > On Thu, Jan 03, 2013 at 03:34:14PM -0800, Andrew Morton wrote: > > > On Wed, 26 Dec 2012 18:00:04 -0800 > > > Kent Overstreet <koverstreet@xxxxxxxxxx> wrote: > > > > > > > So, for sticking kiocb completions on the kioctx ringbuffer, we need a > > > > lock - it unfortunately can't be lockless. > > > > > > > > When the kioctx is shared between threads on different cpus and the rate > > > > of completions is high, this lock sees quite a bit of contention - in > > > > terms of cacheline contention it's the hottest thing in the aio > > > > subsystem. > > > > > > > > That means, with a regular spinlock, we're going to take a cache miss > > > > to grab the lock, then another cache miss when we touch the data the > > > > lock protects - if it's on the same cacheline as the lock, other cpus > > > > spinning on the lock are going to be pulling it out from under us as > > > > we're using it. > > > > > > > > So, we use an old trick to get rid of this second forced cache miss - > > > > make the data the lock protects be the lock itself, so we grab them both > > > > at once. > > > > > > Boy I hope you got that right. > > > > > > Did you consider using bit_spin_lock() on the upper bit of `tail'? > > > We've done that in other places and we at least know that it works. > > > And it has the optimisations for CONFIG_SMP=n, understands > > > CONFIG_DEBUG_SPINLOCK, has arch-specific optimisations, etc. > > > > I hadn't thought of that - I think it'd suffer from the same problem as > > a regular spinlock, where you grab the lock, then go to grab your data > > but a different CPU grabbed the cacheline you need... > > Either you didn't understand my suggestion or I didn't understand your > patch :( Safest bet to blame me :p > I'm suggesting that we use the msot significant bit *of the data* as > that data's lock. Obviously, all uses of that data would then mask that > bit out. Yeah, I got that. > That way, the data will be brought into CPU cache when the lock is > acquired. And when other CPUs attempt to acquire the lock, they won't > steal the cacheline. > > This is assuming that an unsuccessful test_and_set_bit_lock() won't > grab the cacheline, which is hopefully true but I don't know. If this > turns out to be false then we could add a test_bit() loop to > bit_spin_lock(), or perhaps rework bit_spin_lock() to not do the > test_and_set_bit_lock() unless test_bit() has just returned 0. I was assuming it would grab the cacheline, but I could easily be mistaken about that. I know spinning unsuccesfully on a spinlock will grab the cacheline, I would _think_ both test_and_set_bit_lock() and test_bit() would as well but since I haven't yet benchmarked or profiled it I could just be making stuff up. I'll try it out. > > But the lock debugging would be nice. It'd probably work to make > > something generic like bit_spinlock() that also returns some value - or, > > the recent patches for making spinlocks back off will also help with > > this problem. So maybe between that and batch completion this patch > > could be dropped at some point. > > > > So, yeah. The code's plenty tested and I went over the barriers, it > > already had all the needed barriers due to the ringbuffer... and I've > > done this sort of thing elsewhere too. But it certaintly is a hack and I > > wouldn't be sad to see it go. > > Yes, there are a lot of issues with adding a new locking primitive and > in some ways they get worse when they're open-coded like this. If > there's any way at all of using a standard lock instead of KentLocks > then we should do this. Yeah, that's certainly true. Honestly, the performance improvements from this trick were noticable but nothing like the rest of the patch series - I really wouldn't complain if we just dropped this. Oh, and bit_spin_lock() doesn't have an _irqsave() variant, bleh. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html