Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath

"Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> · Tue, 13 May 2014 08:27:19 -0700

On Tue, May 13, 2014 at 04:17:48PM +0200, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 01:53:13PM +0100, Mel Gorman wrote:
> > On Tue, May 13, 2014 at 10:45:50AM +0100, Mel Gorman wrote:
> > >  void unlock_page(struct page *page)
> > >  {
> > > +	wait_queue_head_t *wqh = clear_page_waiters(page);
> > > +
> > >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > +
> > > +	/*
> > > +	 * No additional barrier needed due to clear_bit_unlock barriering all updates
> > > +	 * before waking waiters
> > > +	 */
> > >  	clear_bit_unlock(PG_locked, &page->flags);
> > > -	smp_mb__after_clear_bit();
> > > -	wake_up_page(page, PG_locked);
> > 
> > This is wrong. The smp_mb__after_clear_bit() is still required to ensure
> > that the cleared bit is visible before the wakeup on all architectures.
> 
> wakeup implies a mb, and I just noticed that our Documentation is
> 'obsolete' and only mentions it implies a wmb.
> 
> Also, if you're going to use smp_mb__after_atomic() you can use
> clear_bit() and not use clear_bit_unlock().
> 
> 
> 
> ---
> Subject: doc: Update wakeup barrier documentation
> 
> As per commit e0acd0a68ec7 ("sched: fix the theoretical signal_wake_up()
> vs schedule() race") both wakeup and schedule now imply a full barrier.
> 
> Furthermore, the barrier is unconditional when calling try_to_wake_up()
> and has been for a fair while.
> 
> Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Cc: David Howells <dhowells@xxxxxxxxxx>
> Cc: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> Signed-off-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>

Some questions below.

							Thanx, Paul

> ---
>  Documentation/memory-barriers.txt | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 46412bded104..dae5158c2382 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1881,9 +1881,9 @@ The whole sequence above is available in various canned forms, all of which
>  	event_indicated = 1;
>  	wake_up_process(event_daemon);
> 
> -A write memory barrier is implied by wake_up() and co. if and only if they wake
> -something up.  The barrier occurs before the task state is cleared, and so sits
> -between the STORE to indicate the event and the STORE to set TASK_RUNNING:
> +A full memory barrier is implied by wake_up() and co. The barrier occurs

Last I checked, the memory barrier was guaranteed only if a wakeup
actually occurred.  If there is a sleep-wakeup race, for example,
between wait_event_interruptible() and wake_up(), then it looks to me
that the following can happen:

o	Task A invokes wait_event_interruptible(), waiting for
	X==1.

o	Before Task A gets anywhere, Task B sets Y=1, does
	smp_mb(), then sets X=1.

o	Task B invokes wake_up(), which invokes __wake_up(), which
	acquires the wait_queue_head_t's lock and invokes
	__wake_up_common(), which sees nothing to wake up.

o	Task A tests the condition, finds X==1, and returns without
	locks, memory barriers, atomic instructions, or anything else
	that would guarantee ordering.

o	Task A then loads from Y.  Because there have been no memory
	barriers, it might well see Y==0.

So what am I missing here?

On the other hand, if a wake_up() really does happen, then
the fast-path out of wait_event_interruptible() is not taken,
and __wait_event_interruptible() is called instead.  This calls
___wait_event(), which eventually calls prepare_to_wait_event(), which
in turn calls set_current_state(), which calls set_mb(), which does a
full memory barrier.  And if that isn't good enough, there is the
call to schedule() itself.  ;-)

So if a wait actually sleeps, it does imply a full memory barrier
several times over.

On the wake_up() side, wake_up() calls __wake_up(), which as mentioned
earlier calls __wake_up_common() under a lock.  This invokes the
wake-up function stored by the sleeping task, for example,
autoremove_wake_function(), which calls default_wake_function(),
which invokes try_to_wake_up(), which does smp_mb__before_spinlock()
before acquiring the to-be-waked task's PI lock.

The definition of smp_mb__before_spinlock() is smp_wmb().  There is
also an smp_rmb() in try_to_wake_up(), which still does not get us
to a full memory barrier.  It also calls select_task_rq(), which
does not seem to guarantee any particular memory ordering (but
I could easily have missed something).  It also calls ttwu_queue(),
which invokes ttwu_do_activate() under the RQ lock.  I don't see a
full memory barrier in ttwu_do_activate(), but again could easily
have missed one.  Ditto for ttwu_stat().

All the locks nest, so other than the smp_wmb() and smp_rmb(), things
could bleed in.

> +before the task state is cleared, and so sits between the STORE to indicate
> +the event and the STORE to set TASK_RUNNING:

If I am in fact correct, and if we really want to advertise the read
memory barrier, I suggest the following replacement text:

	A read and a write memory barrier (-not- a full memory barrier)
	are implied by wake_up() and co. if and only if they wake
	something up.  The write barrier occurs before the task state is
	cleared, and so sits between the STORE to indicate the event and
	the STORE to set TASK_RUNNING, and the read barrier after that:

	CPU 1				CPU 2
	===============================	===============================
	set_current_state();		STORE event_indicated
	  set_mb();			wake_up();
	    STORE current->state	  <write barrier>
	    <general barrier>		  STORE current->state
	LOAD event_indicated		  <read barrier>

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html