[merged] mm-clear-pages_scanned-only-if-draining-a-pcp-adds-pages-to-the-buddy-allocator.patch removed from -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Wed, 26 Jan 2011 12:32:37 -0800

The patch titled
     mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator
has been removed from the -mm tree.  Its filename was
     mm-clear-pages_scanned-only-if-draining-a-pcp-adds-pages-to-the-buddy-allocator.patch

This patch was dropped because it was merged into mainline or a subsystem tree

The current -mm tree may be found at http://userweb.kernel.org/~akpm/mmotm/

------------------------------------------------------
Subject: mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator
From: David Rientjes <rientjes@xxxxxxxxxx>

0e093d99763e (writeback: do not sleep on the congestion queue if there are
no congested BDIs or if significant congestion is not being encountered in
the current zone) uncovered a livelock in the page allocator that resulted
in tasks infinitely looping trying to find memory and kswapd running at
100% cpu.

The issue occurs because drain_all_pages() is called immediately following
direct reclaim when no memory is freed and try_to_free_pages() returns
non-zero because all zones in the zonelist do not have their
all_unreclaimable flag set.

When draining the per-cpu pagesets back to the buddy allocator for each
zone, the zone->pages_scanned counter is cleared to avoid erroneously
setting zone->all_unreclaimable later.  The problem is that no pages may
actually be drained and, thus, the unreclaimable logic never fails direct
reclaim so the oom killer may be invoked.

This apparently only manifested after wait_iff_congested() was introduced
and the zone was full of anonymous memory that would not congest the
backing store.  The page allocator would infinitely loop if there were no
other tasks waiting to be scheduled and clear zone->pages_scanned because
of drain_all_pages() as the result of this change before kswapd could scan
enough pages to trigger the reclaim logic.  Additionally, with every loop
of the page allocator and in the reclaim path, kswapd would be kicked and
would end up running at 100% cpu.  In this scenario, current and kswapd
are all running continuously with kswapd incrementing zone->pages_scanned
and current clearing it.

The problem is even more pronounced when current swaps some of its memory
to swap cache and the reclaimable logic then considers all active
anonymous memory in the all_unreclaimable logic, which requires a much
higher zone->pages_scanned value for try_to_free_pages() to return zero
that is never attainable in this scenario.

Before wait_iff_congested(), the page allocator would incur an
unconditional timeout and allow kswapd to elevate zone->pages_scanned to a
level that the oom killer would be called the next time it loops.

The fix is to only attempt to drain pcp pages if there is actually a
quantity to be drained.  The unconditional clearing of zone->pages_scanned
in free_pcppages_bulk() need not be changed since other callers already
ensure that draining will occur.  This patch ensures that
free_pcppages_bulk() will actually free memory before calling into it from
drain_all_pages() so zone->pages_scanned is only cleared if appropriate.

Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx>
Cc: Mel Gorman <mel@xxxxxxxxx>
Reviewed-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Minchan Kim <minchan.kim@xxxxxxxxx>
Cc: Wu Fengguang <fengguang.wu@xxxxxxxxx>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
Reviewed-by: Rik van Riel <riel@xxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/page_alloc.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff -puN mm/page_alloc.c~mm-clear-pages_scanned-only-if-draining-a-pcp-adds-pages-to-the-buddy-allocator mm/page_alloc.c

--- a/mm/page_alloc.c~mm-clear-pages_scanned-only-if-draining-a-pcp-adds-pages-to-the-buddy-allocator
+++ a/mm/page_alloc.c
@@ -1088,8 +1088,10 @@ static void drain_pages(unsigned int cpu
 		pset = per_cpu_ptr(zone->pageset, cpu);
 
 		pcp = &pset->pcp;
-		free_pcppages_bulk(zone, pcp->count, pcp);
-		pcp->count = 0;
+		if (pcp->count) {
+			free_pcppages_bulk(zone, pcp->count, pcp);
+			pcp->count = 0;
+		}
 		local_irq_restore(flags);
 	}
 }
_

Patches currently in -mm which might be from rientjes@xxxxxxxxxx are

origin.patch
x86-numa-add-error-handling-for-bad-cpu-to-node-mappings.patch
oom-suppress-nodes-that-are-not-allowed-from-meminfo-on-oom-kill.patch
oom-suppress-show_mem-for-many-nodes-in-irq-context-on-page-alloc-failure.patch
oom-suppress-nodes-that-are-not-allowed-from-meminfo-on-page-alloc-failure.patch
jbd-remove-dependency-on-__gfp_nofail.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html