Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

Mel Gorman <mgorman@xxxxxxx> · Thu, 22 May 2014 00:35:28 +0100

On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
> On Wed, 21 May 2014 13:15:01 +0100 Mel Gorman <mgorman@xxxxxxx> wrote:
> 
> > Andrew had suggested dropping v4 of the patch entirely as the numbers were
> > marginal and the complexity was high. However, even on a relatively small
> > machine running simple workloads the overhead of page_waitqueue and wakeup
> > functions is around 5% of system CPU time. That's quite high for basic
> > operations so I felt it was worth another shot. The performance figures
> > are better with this version than they were for v4 and overall the patch
> > should be more comprehensible.
> > 
> > Changelog since v4
> > o Remove dependency on io_schedule_timeout
> > o Push waiting logic down into waitqueue
> > 
> > This patch introduces a new page flag for 64-bit capable machines,
> > PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> > avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> > 
> > This adds a few branches to the fast path but avoids bouncing a dirty
> > cache line between CPUs. 32-bit machines always take the slow path but the
> > primary motivation for this patch is large machines so I do not think that
> > is a concern.
> > 
> > The test case used to evaulate this is a simple dd of a large file done
> > multiple times with the file deleted on each iterations. The size of
> > the file is 1/10th physical memory to avoid dirty page balancing. In the
> > async case it will be possible that the workload completes without even
> > hitting the disk and will have variable results but highlight the impact
> > of mark_page_accessed for async IO. The sync results are expected to be
> > more stable. The exception is tmpfs where the normal case is for the "IO"
> > to not hit the disk.
> > 
> > The test machine was single socket and UMA to avoid any scheduling or
> > NUMA artifacts. Throughput and wall times are presented for sync IO, only
> > wall times are shown for async as the granularity reported by dd and the
> > variability is unsuitable for comparison. As async results were variable
> > do to writback timings, I'm only reporting the maximum figures. The sync
> > results were stable enough to make the mean and stddev uninteresting.
> > 
> > The performance results are reported based on a run with no profiling.
> > Profile data is based on a separate run with oprofile running. The
> > kernels being compared are "accessed-v2" which is the patch series up
> > to this patch where as lockpage-v2 includes this patch.
> > 
> > ...
> >
> > --- a/include/linux/wait.h
> > +++ b/include/linux/wait.h
> > @@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *k
> >  void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
> >  void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
> >  void __wake_up_bit(wait_queue_head_t *, void *, int);
> > +void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int);
> 
> You're going to need to forward-declare struct page in wait.h.  The
> good thing about this is that less people will notice that we've gone
> and mentioned struct page in wait.h :(
> 

Will add the forward-declare.

> >  int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
> > +int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
> > 
> > ...
> >
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
> >   * stops them from bleeding out - it would still allow subsequent
> >   * loads to move into the critical region).
> >   */
> > -void
> > -prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
> > +static inline void
> > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +			struct page *page, int state, bool exclusive)
> 
> Putting MM stuff into core waitqueue code is rather bad.  I really
> don't know how I'm going to explain this to my family.
> 

The alternative is updating wait.h and wait.c was open-coding the waitqueue
modifications in filemap.c but that is just as ugly. The wait queue stuff
is complex and there was motivation to keep it in one place even if we
are special casing struct page handling.

FWIW, I cannot explain anything I do in work to my family. It gets blank
looks no matter what.

> >  {
> >  	unsigned long flags;
> >  
> > -	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> >  	spin_lock_irqsave(&q->lock, flags);
> > -	if (list_empty(&wait->task_list))
> > -		__add_wait_queue(q, wait);
> > +	if (page && !PageWaiters(page))
> > +		SetPageWaiters(page);
> 
> And this isn't racy because we're assuming that all users of `page' are
> using the same waitqueue.  ie, assuming all callers use
> page_waitqueue()?   Subtle, unobvious, worth documenting.
> 

All users of the page will get the same waitqueue. page_waitqueue is hashed
on the pointer value.

> > +	if (list_empty(&wait->task_list)) {
> > +		if (exclusive) {
> > +			wait->flags |= WQ_FLAG_EXCLUSIVE;
> > +			__add_wait_queue_tail(q, wait);
> > +		} else {
> > +			wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> > +			__add_wait_queue(q, wait);
> > +		}
> > +	}
> >  	set_current_state(state);
> >  	spin_unlock_irqrestore(&q->lock, flags);
> >  }
> > 
> > ...
> >
> > @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
> >   * the wait descriptor from the given waitqueue if still
> >   * queued.
> >   */
> > -void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> > +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +			struct page *page)
> 
> Thusly does kerneldoc bitrot.
> 

Now, I am become Rot, Destroyer of Kerneldoc.

Kerneldoc comment moved to correct location.

> >  {
> >  	unsigned long flags;
> >  
> > @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> >  	if (!list_empty_careful(&wait->task_list)) {
> >  		spin_lock_irqsave(&q->lock, flags);
> >  		list_del_init(&wait->task_list);
> > +		if (page && !waitqueue_active(q))
> > +			ClearPageWaiters(page);
> 
> And again, the assumption that all users of this page use the same
> waitqueue avoids the races?
> 

Yes. There is no guarantee that the bit will be cleared if there are no
waiters on that specific page if there are waitqueue collisions but
detecting that accurately would require ref counts.

Will stick in a comment.

> >  		spin_unlock_irqrestore(&q->lock, flags);
> >  	}
> >  }
> > +
> > +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> > +{
> > +	return __finish_wait(q, wait, NULL);
> > +}
> >  EXPORT_SYMBOL(finish_wait);
> >  
> >  /**
> > @@ -331,6 +347,22 @@ __wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
> >  	finish_wait(wq, &q->wait);
> >  	return ret;
> >  }
> > +
> > +int __sched
> > +__wait_on_page_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
> > +			struct page *page,
> > +			int (*action)(void *), unsigned mode)
> 
> Comment over __wait_on_bit needs updating.
> 
> > +{
> > +	int ret = 0;
> > +
> > +	do {
> > +		__prepare_to_wait(wq, &q->wait, page, mode, false);
> > +		if (test_bit(q->key.bit_nr, q->key.flags))
> > +			ret = (*action)(q->key.flags);
> > +	} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
> > +	__finish_wait(wq, &q->wait, page);
> > +	return ret;
> > +}
> 
> __wait_on_bit() can now become a wrapper which calls this with page==NULL?
> 

Yep. Early in development this would have looked ugly but not in the
final version but failed to fix it up.

> >  EXPORT_SYMBOL(__wait_on_bit);
> 
> This export is now misplaced.
> 
> >  int __sched out_of_line_wait_on_bit(void *word, int bit,
> > @@ -344,6 +376,27 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
> >  EXPORT_SYMBOL(out_of_line_wait_on_bit);
> >  
> >  int __sched
> > +__wait_on_page_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> > +			struct page *page,
> > +			int (*action)(void *), unsigned mode)
> > +{
> > +	do {
> > +		int ret;
> > +
> > +		__prepare_to_wait(wq, &q->wait, page, mode, true);
> > +		if (!test_bit(q->key.bit_nr, q->key.flags))
> > +			continue;
> > +		ret = action(q->key.flags);
> > +		if (!ret)
> > +			continue;
> > +		abort_exclusive_wait(wq, &q->wait, mode, &q->key);
> > +		return ret;
> > +	} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
> > +	__finish_wait(wq, &q->wait, page);
> > +	return 0;
> > +}
> 
> You are in a maze of twisty little functions, all alike.  Perhaps some
> rudimentary documentation here?  Like what on earth does
> __wait_on_page_bit_lock() actually do?   And `mode'.
> 

Most of the useful commentary on what this does is in the kerneldoc comment
for wait_on_bit including the meaning of "mode".

> 
> > +int __sched
> >  __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> >  			int (*action)(void *), unsigned mode)
> 
> Perhaps __wait_on_bit_lock() can become a wrapper around
> __wait_on_page_bit_lock().
> 

Yes.

> > 
> > ...
> >
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -67,6 +67,10 @@ static void __page_cache_release(struct page *page)
> >  static void __put_single_page(struct page *page)
> >  {
> >  	__page_cache_release(page);
> > +
> > +	/* Clear dangling waiters from collisions on page_waitqueue */
> > +	__ClearPageWaiters(page);
> 
> What's this collisions thing?
> 

I'll expand the comment.

> >  	free_hot_cold_page(page, false);
> >  }
> >  
> > 
> > ...
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1096,6 +1096,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		 * waiting on the page lock, because there are no references.
> >  		 */
> >  		__clear_page_locked(page);
> > +		__ClearPageWaiters(page);
> 
> We're freeing the page - if someone is still waiting on it then we have
> a huge bug?  It's the mysterious collision thing again I hope?
> 

Yes. Freeing a page with active waiters should also be attempting to free
a page with an elevated ref count. The page allocator should catch that
and scream.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html