On Thu, May 22, 2014 at 08:45:29AM +0200, Peter Zijlstra wrote: > On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote: > > On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > Alternative solution is not to merge the patch ;) > > There is always that.. :-) > > > > Yeah, so we only clear that bit when at 'unlock' we find there are no > > > more pending waiters, so if the last unlock still had a waiter, we'll > > > leave the bit set. > > > > Confused. If the last unlock had a waiter, that waiter will get woken > > up so there are no waiters any more, so the last unlock clears the flag. > > > > um, how do we determine that there are no more waiters? By looking at > > the waitqueue. But that waitqueue is hashed, so it may contain waiters > > for other pages so we're screwed? But we could just go and wake up the > > other-page waiters anyway and still clear PG_waiters? > > > > um2, we're using exclusive waitqueues so we can't (or don't) wake all > > waiters, so we're screwed again? > > Ah, so leave it set. Then when we do an uncontended wakeup, that is a > wakeup where there are _no_ waiters left, we'll iterate the entire > hashed queue, looking for a matching page. > > We'll find none, and only then clear the bit. > Yes, sorry that was not clear. > > > (This process is proving to be a hard way of writing Mel's changelog btw). > > Agreed :/ > I've lost sight of what is obvious and what is not. The introduction now reads This patch introduces a new page flag for 64-bit capable machines, PG_waiters, to signal there are *potentially* processes waiting on PG_lock or PG_writeback. If there are no possible waiters then we avoid barriers, a waitqueue hash lookup and a failed wake_up in the unlock_page and end_page_writeback paths. There is no guarantee that waiters exist if PG_waiters is set as multiple pages can hash to the same waitqueue and we cannot accurately detect if a waking process is the last waiter without a reference count. When this happens, the bit is left set and the next unlock or writeback completion will lookup the waitqueue and clear the bit when there are no collisions. This adds a few branches to the fast path but avoids bouncing a dirty cache line between CPUs. 32-bit machines always take the slow path but the primary motivation for this patch is large machines so I do not think that is a concern. > > If I'm still on track here, what happens if we switch to wake-all so we > > can avoid the dangling flag? I doubt if there are many collisions on > > that hash table? > > Wake-all will be ugly and loose a herd of waiters, all racing to > acquire, all but one of whoem will loose the race. It also looses the > fairness, its currently a FIFO queue. Wake-all will allow starvation. > And the cost of the thundering herd of waiters may offset any benefit of reducing the number of calls to page_waitqueue and waker functions. > > If there *are* a lot of collisions, I bet it's because a great pile of > > threads are all waiting on the same page. If they're trying to lock > > that page then wake-all is bad. But if they're just waiting for IO > > completion (probable) then it's OK. > > Yeah, I'm not entirely sure on the rationale for adding PG_waiters to > writeback completion, and yes PG_writeback is a wake-all. tmpfs was the most obvious one. We were doing a useless lookup almost every time writeback completed for async streaming writers. I suspected it would also apply to normal filesystems if backed by fast storage. There is not much to gain by continuing to use __wake_up_bit in the writeback pathso when PG_waiters is available. Only the first waiters incurs the SetPageWaiters penalty. In the uncontended case, neither take locks (one approach checks waitqueue_active outside the lock, the other checks PageWaiters). Both approaches end up taking q->lock either in __wake_up_bit->__wake_up or __wake_up_page_bit but in some cases __wake_up_page_bit. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html