[PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v8

Mel Gorman <mgorman@xxxxxxx> · Thu, 22 May 2014 17:58:27 +0100

Changelog since v7
o Further optimisation when PG_waiters is not available	(peterz)
o Catch all opportunities to ClearPageWaiters		(peterz)

Changelog since v6
o Optimisation when PG_waiters is not available		(peterz)
o Documentation

Changelog since v5
o __always_inline where appropriate			(peterz)
o Documentation						(akpm)

Changelog since v4
o Remove dependency on io_schedule_timeout
o Push waiting logic down into waitqueue

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are *potentially* processes waiting on
PG_lock or PG_writeback.  If there are no possible waiters then we avoid
barriers, a waitqueue hash lookup and a failed wake_up in the unlock_page
and end_page_writeback paths. There is no guarantee that waiters exist if
PG_waiters is set as multiple pages can hash to the same waitqueue and we
cannot accurately detect if a waking process is the last waiter without
a reference count. When this happens, the bit is left set and a future
unlock or writeback completion will lookup the waitqueue and clear the
bit when there are no collisions. This adds a few branches to the fast
path but avoids bouncing a dirty cache line between CPUs. 32-bit machines
always take the slow path but the primary motivation for this patch is
large machines so I do not think that is a concern.

The test case used to evaluate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. After each
dd there is a sync so the reported times do not vary much. By measuring
the time it takes to do async the impact of page_waitqueue overhead for
async IO is highlighted.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. The performance results are reported based on a run with
no profiling.  Profile data is based on a separate run with oprofile running.

async dd
                                 3.15.0-rc5            3.15.0-rc5
                                      mmotm           lockpage-v8
btrfs Max      ddtime      0.5863 (  0.00%)      0.5593 (  4.61%)
ext3  Max      ddtime      1.4870 (  0.00%)      1.4609 (  1.76%)
ext4  Max      ddtime      1.0440 (  0.00%)      1.0376 (  0.61%)
tmpfs Max      ddtime      0.3541 (  0.00%)      0.3478 (  1.76%)
xfs   Max      ddtime      0.4995 (  0.00%)      0.4762 (  4.65%)

A separate run with profiles showed this

     samples percentage
ext3  225851    2.3180  vmlinux-3.15.0-rc5-mmotm       test_clear_page_writeback
ext3  106848    1.0966  vmlinux-3.15.0-rc5-mmotm       __wake_up_bit
ext3   71849    0.7374  vmlinux-3.15.0-rc5-mmotm       page_waitqueue
ext3   40319    0.4138  vmlinux-3.15.0-rc5-mmotm       unlock_page
ext3   26243    0.2693  vmlinux-3.15.0-rc5-mmotm       end_page_writeback
ext3  203718    2.1020  vmlinux-3.15.0-rc5-lockpage-v8 test_clear_page_writeback
ext3   64004    0.6604  vmlinux-3.15.0-rc5-lockpage-v8 unlock_page
ext3   24753    0.2554  vmlinux-3.15.0-rc5-lockpage-v8 end_page_writeback
ext3    8618    0.0889  vmlinux-3.15.0-rc5-lockpage-v8 __wake_up_bit
ext3    7247    0.0748  vmlinux-3.15.0-rc5-lockpage-v8 __wake_up_page_bit
ext3    2012    0.0208  vmlinux-3.15.0-rc5-lockpage-v8 page_waitqueue

The profiles show a clear reduction in waitqueue and wakeup functions. Note
that end_page_writeback costs the same as the savings there are due
to reduced calls to __wake_up_bit and page_waitqueue so there is no
obvious direct savings. The cost of unlock_page is higher as it's checking
PageWaiters but it is offset by reduced numbers of calls to page_waitqueue
and _wake_up_bit. There is a similar story told for each of the filesystems.
Note that for workloads that contend heavily on the page lock that
unlock_page may increase in cost as it has to clear PG_waiters so while
the typical case should be much faster, the worst case costs are now higher.

This is also reflected in the time taken to mmap a range of pages.
These are the results for xfs only but the other filesystems tell a
similar story.

                       3.15.0-rc5            3.15.0-rc5
                            mmotm           lockpage-v8
Procs 107M     423.0000 (  0.00%)    409.0000 (  3.31%)
Procs 214M     847.0000 (  0.00%)    821.0000 (  3.07%)
Procs 322M    1296.0000 (  0.00%)   1232.0000 (  4.94%)
Procs 429M    1692.0000 (  0.00%)   1646.0000 (  2.72%)
Procs 536M    2137.0000 (  0.00%)   2052.0000 (  3.98%)
Procs 644M    2542.0000 (  0.00%)   2472.0000 (  2.75%)
Procs 751M    2953.0000 (  0.00%)   2871.0000 (  2.78%)
Procs 859M    3360.0000 (  0.00%)   3290.0000 (  2.08%)
Procs 966M    3770.0000 (  0.00%)   3678.0000 (  2.44%)
Procs 1073M   4220.0000 (  0.00%)   4101.0000 (  2.82%)
Procs 1181M   4638.0000 (  0.00%)   4518.0000 (  2.59%)
Procs 1288M   5038.0000 (  0.00%)   4934.0000 (  2.06%)
Procs 1395M   5481.0000 (  0.00%)   5344.0000 (  2.50%)
Procs 1503M   5940.0000 (  0.00%)   5764.0000 (  2.96%)
Procs 1610M   6316.0000 (  0.00%)   6186.0000 (  2.06%)
Procs 1717M   6749.0000 (  0.00%)   6595.0000 (  2.28%)
Procs 1825M   7323.0000 (  0.00%)   7034.0000 (  3.95%)
Procs 1932M   7694.0000 (  0.00%)   7461.0000 (  3.03%)
Procs 2040M   8079.0000 (  0.00%)   7837.0000 (  3.00%)
Procs 2147M   8495.0000 (  0.00%)   8351.0000 (  1.70%)

   samples percentage
xfs  78334    1.3089  vmlinux-3.15.0-rc5-mmotm          page_waitqueue
xfs  55910    0.9342  vmlinux-3.15.0-rc5-mmotm          unlock_page
xfs  45120    0.7539  vmlinux-3.15.0-rc5-mmotm          __wake_up_bit
xfs  41414    0.6920  vmlinux-3.15.0-rc5-mmotm          test_clear_page_writeback
xfs   4823    0.0806  vmlinux-3.15.0-rc5-mmotm          end_page_writeback
xfs 120504    2.0046  vmlinux-3.15.0-rc5-lockpage-v8    unlock_page
xfs  49179    0.8181  vmlinux-3.15.0-rc5-lockpage-v8    test_clear_page_writeback
xfs   5397    0.0898  vmlinux-3.15.0-rc5-lockpage-v8    end_page_writeback
xfs   2101    0.0350  vmlinux-3.15.0-rc5-lockpage-v8    __wake_up_bit
xfs      5   8.3e-05  vmlinux-3.15.0-rc5-lockpage-v8    page_waitqueue
xfs      4   6.7e-05  vmlinux-3.15.0-rc5-lockpage-v8    __wake_up_page_bit

[jack@xxxxxxx: Fix add_page_wait_queue]
[mhocko@xxxxxxx: Use sleep_on_page_killable in __wait_on_page_locked_killable]
[steiner@xxxxxxx: Do not update struct page unnecessarily]
[peterz@xxxxxxxxxxxxx: consolidate within wait.c, catch all ClearPageWaiters]
Signed-off-by: Nick Piggin <npiggin@xxxxxxx>
Signed-off-by: Mel Gorman <mgorman@xxxxxxx>
---
 include/linux/page-flags.h |  18 +++++
 include/linux/wait.h       |   8 +++
 kernel/sched/wait.c        | 161 ++++++++++++++++++++++++++++++++++++---------
 mm/filemap.c               |  25 +++----
 mm/page_alloc.c            |   1 +
 mm/swap.c                  |  12 ++++
 mm/vmscan.c                |   7 ++
 7 files changed, 189 insertions(+), 43 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7baf0fe..b697e4f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -87,6 +87,7 @@ enum pageflags {
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	PG_waiters,		/* Page has PG_locked waiters. */
 	PG_head,		/* A head page */
 	PG_tail,		/* A tail page */
 #else
@@ -213,6 +214,22 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters)
+	TESTCLEARFLAG(Waiters, waiters)
+#define __PG_WAITERS		(1 << PG_waiters)
+#else
+/* Always fallback to slow path on 32-bit */
+static inline bool PageWaiters(struct page *page)
+{
+	return true;
+}
+static inline void __ClearPageWaiters(struct page *page) {}
+static inline void ClearPageWaiters(struct page *page) {}
+static inline void SetPageWaiters(struct page *page) {}
+#define __PG_WAITERS		0
+#endif /* CONFIG_PAGEFLAGS_EXTENDED */
+
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
@@ -509,6 +526,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_WAITERS | \
 	 __PG_COMPOUND_LOCK)
 
 /*
diff --git a/include/linux/wait.h b/include/linux/wait.h
index bd68819..9226724 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -141,14 +141,21 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
 	list_del(&old->task_list);
 }
 
+struct page;
+
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_bit(wait_queue_head_t *, void *, int);
+void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int);
 int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
+int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
+				struct page *page, int (*)(void *), unsigned);
 int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
+int __wait_on_page_bit_lock(wait_queue_head_t *, struct wait_bit_queue *,
+				struct page *page, int (*)(void *), unsigned);
 void wake_up_bit(void *, int);
 void wake_up_atomic_t(atomic_t *);
 int out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned);
@@ -822,6 +829,7 @@ void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state);
 long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
+void finish_wait_page(wait_queue_head_t *q, wait_queue_t *wait, struct page *page);
 void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, unsigned int mode, void *key);
 int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20a..43e7df0 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -167,31 +167,47 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
  * stops them from bleeding out - it would still allow subsequent
  * loads to move into the critical region).
  */
-void
-prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
+static __always_inline void
+__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
+			struct page *page, int state, bool exclusive)
 {
 	unsigned long flags;
 
-	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
-	if (list_empty(&wait->task_list))
-		__add_wait_queue(q, wait);
+
+	/*
+	 * pages are hashed on a waitqueue that is expensive to lookup.
+	 * __wait_on_page_bit and __wait_on_page_bit_lock pass in a page
+	 * to set PG_waiters here. A PageWaiters() can then be used at
+	 * unlock time or when writeback completes to detect if there
+	 * are any potential waiters that justify a lookup.
+	 */
+	if (page && !PageWaiters(page))
+		SetPageWaiters(page);
+	if (list_empty(&wait->task_list)) {
+		if (exclusive) {
+			wait->flags |= WQ_FLAG_EXCLUSIVE;
+			__add_wait_queue_tail(q, wait);
+		} else {
+			wait->flags &= ~WQ_FLAG_EXCLUSIVE;
+			__add_wait_queue(q, wait);
+		}
+	}
 	set_current_state(state);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
+
+void
+prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
+{
+	return __prepare_to_wait(q, wait, NULL, state, false);
+}
 EXPORT_SYMBOL(prepare_to_wait);
 
 void
 prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
 {
-	unsigned long flags;
-
-	wait->flags |= WQ_FLAG_EXCLUSIVE;
-	spin_lock_irqsave(&q->lock, flags);
-	if (list_empty(&wait->task_list))
-		__add_wait_queue_tail(q, wait);
-	set_current_state(state);
-	spin_unlock_irqrestore(&q->lock, flags);
+	return __prepare_to_wait(q, wait, NULL, state, true);
 }
 EXPORT_SYMBOL(prepare_to_wait_exclusive);
 
@@ -219,16 +235,8 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
 }
 EXPORT_SYMBOL(prepare_to_wait_event);
 
-/**
- * finish_wait - clean up after waiting in a queue
- * @q: waitqueue waited on
- * @wait: wait descriptor
- *
- * Sets current thread back to running state and removes
- * the wait descriptor from the given waitqueue if still
- * queued.
- */
-void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
+static __always_inline void __finish_wait(wait_queue_head_t *q,
+			wait_queue_t *wait, struct page *page)
 {
 	unsigned long flags;
 
@@ -249,9 +257,33 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
 	if (!list_empty_careful(&wait->task_list)) {
 		spin_lock_irqsave(&q->lock, flags);
 		list_del_init(&wait->task_list);
+
+		/*
+		 * Clear PG_waiters if the waitqueue is no longer active. There
+		 * is no guarantee that a page with no waiters will get cleared
+		 * as there may be unrelated pages hashed to sleep on the same
+		 * queue. Accurate detection would require a counter but
+		 * collisions are expected to be rare.
+		 */
+		if (page && !waitqueue_active(q))
+			ClearPageWaiters(page);
 		spin_unlock_irqrestore(&q->lock, flags);
 	}
 }
+
+/**
+ * finish_wait - clean up after waiting in a queue
+ * @q: waitqueue waited on
+ * @wait: wait descriptor
+ *
+ * Sets current thread back to running state and removes
+ * the wait descriptor from the given waitqueue if still
+ * queued.
+ */
+void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
+{
+	return __finish_wait(q, wait, NULL);
+}
 EXPORT_SYMBOL(finish_wait);
 
 /**
@@ -313,24 +345,39 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
 EXPORT_SYMBOL(wake_bit_function);
 
 /*
- * To allow interruptible waiting and asynchronous (i.e. nonblocking)
- * waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are
- * permitted return codes. Nonzero return codes halt waiting and return.
+ * waits on a bit to be cleared (see wait_on_bit in wait.h for details.
+ * A page is optionally provided when used to wait on the PG_locked or
+ * PG_writeback bit. By setting PG_waiters a lookup of the waitqueue
+ * can be avoided during unlock_page or end_page_writeback.
  */
 int __sched
-__wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+__wait_on_page_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			struct page *page,
 			int (*action)(void *), unsigned mode)
 {
 	int ret = 0;
 
 	do {
-		prepare_to_wait(wq, &q->wait, mode);
+		__prepare_to_wait(wq, &q->wait, page, mode, false);
 		if (test_bit(q->key.bit_nr, q->key.flags))
 			ret = (*action)(q->key.flags);
 	} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
-	finish_wait(wq, &q->wait);
+	__finish_wait(wq, &q->wait, page);
 	return ret;
 }
+
+/*
+ * To allow interruptible waiting and asynchronous (i.e. nonblocking)
+ * waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are
+ * permitted return codes. Nonzero return codes halt waiting and return.
+ */
+int __sched
+__wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			int (*action)(void *), unsigned mode)
+{
+	return __wait_on_page_bit(wq, q, NULL, action, mode);
+}
+
 EXPORT_SYMBOL(__wait_on_bit);
 
 int __sched out_of_line_wait_on_bit(void *word, int bit,
@@ -344,13 +391,14 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
 EXPORT_SYMBOL(out_of_line_wait_on_bit);
 
 int __sched
-__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+__wait_on_page_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			struct page *page,
 			int (*action)(void *), unsigned mode)
 {
 	do {
 		int ret;
 
-		prepare_to_wait_exclusive(wq, &q->wait, mode);
+		__prepare_to_wait(wq, &q->wait, page, mode, true);
 		if (!test_bit(q->key.bit_nr, q->key.flags))
 			continue;
 		ret = action(q->key.flags);
@@ -359,9 +407,16 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
 		abort_exclusive_wait(wq, &q->wait, mode, &q->key);
 		return ret;
 	} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
-	finish_wait(wq, &q->wait);
+	__finish_wait(wq, &q->wait, page);
 	return 0;
 }
+
+int __sched
+__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			int (*action)(void *), unsigned mode)
+{
+	return __wait_on_page_bit_lock(wq, q, NULL, action, mode);
+}
 EXPORT_SYMBOL(__wait_on_bit_lock);
 
 int __sched out_of_line_wait_on_bit_lock(void *word, int bit,
@@ -380,6 +435,48 @@ void __wake_up_bit(wait_queue_head_t *wq, void *word, int bit)
 	if (waitqueue_active(wq))
 		__wake_up(wq, TASK_NORMAL, 1, &key);
 }
+
+void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
+{
+	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
+	unsigned long flags;
+
+	/*
+	 * If there is no PG_waiters bit (32-bit), then waitqueue_active can be
+	 * checked without wqh->lock as there is no PG_waiters race to protect.
+	 */
+	if (!__PG_WAITERS) {
+		if (waitqueue_active(wqh))
+			__wake_up(wqh, TASK_NORMAL, 1, &key);
+		return;
+	}
+
+	/*
+	 * Unlike __wake_up_bit it is necessary to check waitqueue_active
+	 * under the wqh->lock to avoid races with parallel additions that
+	 * could result in lost wakeups.
+	 */
+	spin_lock_irqsave(&wqh->lock, flags);
+	if (waitqueue_active(wqh)) {
+		/*
+		 * Try waking a task on the queue. Responsibility for clearing
+		 * the PG_waiters bit is left to the last waiter on the
+		 * waitqueue as PageWaiters is called outside wqh->lock and
+		 * we cannot miss wakeups. Due to hashqueue collisions, there
+		 * may be colliding pages that still have PG_waiters set but
+		 * the impact means there will be at least one unnecessary
+		 * lookup of the page waitqueue on the next unlock_page or
+		 * end of writeback.
+		 */
+		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
+	} else {
+		/* No potential waiters, safe to clear PG_waiters */
+		ClearPageWaiters(page);
+	}
+	spin_unlock_irqrestore(&wqh->lock, flags);
+}
+
+
 EXPORT_SYMBOL(__wake_up_bit);
 
 /**
diff --git a/mm/filemap.c b/mm/filemap.c
index 263cffe..07633a4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -682,9 +682,9 @@ static wait_queue_head_t *page_waitqueue(struct page *page)
 	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
 
-static inline void wake_up_page(struct page *page, int bit)
+static inline void wake_up_page(struct page *page, int bit_nr)
 {
-	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
+	__wake_up_page_bit(page_waitqueue(page), page, &page->flags, bit_nr);
 }
 
 void wait_on_page_bit(struct page *page, int bit_nr)
@@ -692,8 +692,8 @@ void wait_on_page_bit(struct page *page, int bit_nr)
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
 	if (test_bit(bit_nr, &page->flags))
-		__wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+		__wait_on_page_bit(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
@@ -704,7 +704,7 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
 	if (!test_bit(bit_nr, &page->flags))
 		return 0;
 
-	return __wait_on_bit(page_waitqueue(page), &wait,
+	return __wait_on_page_bit(page_waitqueue(page), &wait, page,
 			     sleep_on_page_killable, TASK_KILLABLE);
 }
 
@@ -743,7 +743,8 @@ void unlock_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	clear_bit_unlock(PG_locked, &page->flags);
 	smp_mb__after_atomic();
-	wake_up_page(page, PG_locked);
+	if (unlikely(PageWaiters(page)))
+		wake_up_page(page, PG_locked);
 }
 EXPORT_SYMBOL(unlock_page);
 
@@ -769,7 +770,8 @@ void end_page_writeback(struct page *page)
 		BUG();
 
 	smp_mb__after_atomic();
-	wake_up_page(page, PG_writeback);
+	if (unlikely(PageWaiters(page)))
+		wake_up_page(page, PG_writeback);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
@@ -806,8 +808,8 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+	__wait_on_page_bit_lock(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_page);
 
@@ -815,9 +817,10 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	return __wait_on_bit_lock(page_waitqueue(page), &wait,
-					sleep_on_page_killable, TASK_KILLABLE);
+	return __wait_on_page_bit_lock(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_KILLABLE);
 }
+
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd1f005..ebb947d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6603,6 +6603,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{1UL << PG_waiters,		"waiters"	},
 	{1UL << PG_head,		"head"		},
 	{1UL << PG_tail,		"tail"		},
 #else
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..1581dbf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -67,6 +67,10 @@ static void __page_cache_release(struct page *page)
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
+
+	/* See release_pages on why this clear may be necessary */
+	__ClearPageWaiters(page);
+
 	free_hot_cold_page(page, false);
 }
 
@@ -916,6 +920,14 @@ void release_pages(struct page **pages, int nr, bool cold)
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
 
+		/*
+		 * pages are hashed on a waitqueue so there may be collisions.
+		 * When waiters are woken the waitqueue is checked but
+		 * unrelated pages on the queue can leave the bit set. Clear
+		 * it here if that happens.
+		 */
+		__ClearPageWaiters(page);
+
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (zone)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f85041..d7a4969 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1096,6 +1096,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * waiting on the page lock, because there are no references.
 		 */
 		__clear_page_locked(page);
+
+		/* See release_pages on why this clear may be necessary */
+		__ClearPageWaiters(page);
 free_it:
 		nr_reclaimed++;
 
@@ -1427,6 +1430,8 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			/* See release_pages on why this clear may be necessary */
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
@@ -1650,6 +1655,8 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			/* See release_pages on why this clear may be necessary */
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>