On Wed, Mar 04, 2015 at 03:52:55PM -0500, Rik van Riel wrote: > On 03/04/2015 03:03 PM, Shaohua Li wrote: > > kswapd is a per-node based. Sometimes there is imbalance between nodes, > > node A is full of clean file pages (easy to reclaim), node B is > > full of anon pages (hard to reclaim). With memory pressure, kswapd will > > be waken up for both nodes. The kswapd of node B will try to swap, while > > we prefer reclaim pages from node A first. The real issue here is we > > don't have a mechanism to prevent memory allocation from a hard-reclaim > > node (node B here) if there is an easy-reclaim node (node A) to reclaim > > memory. > > > > The swap can happen even with swapiness 0. Below is a simple script to > > trigger it. cpu 1 and 8 are in different node, each has 72G memory: > > truncate -s 70G img > > taskset -c 8 dd if=img of=/dev/null bs=4k > > taskset -c 1 usemem 70G > > > > The swap can even easier to trigger because we have a protect mechanism > > for situation file pages are less than high watermark. This logic makes > > sense but could be more conservative. > > > > This patch doesn't try to fix the kswapd imbalance issue above, but make > > get_scan_count more conservative to select anon pages. The protect > > mechanism is designed for situation file pages are rotated frequently. > > In that situation, page reclaim should be in trouble, eg, priority is > > lower. So let's only apply the protect mechanism in that situation. In > > pratice, this fixes the swap issue in above test. > > > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > Cc: Mel Gorman <mgorman@xxxxxxx> > > Cc: Rik van Riel <riel@xxxxxxxxxx> > > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > > Signed-off-by: Shaohua Li <shli@xxxxxx> > > Doh, never mind my earlier comment. I must be too tired > to look at stuff right... > > I see how your patch helps avoid the problem, but I am > worried about potential side effects. I suspect it could > lead to page cache thrashing when all zones are low on > page cache memory. > > Would it make sense to explicitly check that we are low > on page cache pages in all zones on the scan list, before > forcing anon only scanning, when we get into this function? Ok, we still need to check the priority to make sure kswapd doesn't stuck to zones without enough file pages. How about this one? >From d763f81ce445d11fc9388803172b014b1b61f989 Mon Sep 17 00:00:00 2001 Message-Id: <d763f81ce445d11fc9388803172b014b1b61f989.1425676112.git.shli@xxxxxx> From: Shaohua Li <shli@xxxxxx> Date: Wed, 4 Mar 2015 11:38:04 -0800 Subject: [PATCH] vmscan: get_scan_count selects anon pages conservative kswapd is a per-node based. Sometimes there is imbalance between nodes, node A is full of clean file pages (easy to reclaim), node B is full of anon pages (hard to reclaim). With memory pressure, kswapd will be waken up for both nodes. The kswapd of node B will try to swap, while we prefer reclaim pages from node A first. The real issue here is we don't have a mechanism to prevent memory allocation from a hard-reclaim node (node B here) if there is an easy-reclaim node (node A) to reclaim memory. The swap can happen even with swapiness 0. Below is a simple script to trigger it. cpu 1 and 8 are in different node, each has 72G memory: truncate -s 70G img taskset -c 8 dd if=img of=/dev/null bs=4k taskset -c 1 usemem 70G The swap can even easier to trigger because we have a protect mechanism for situation file pages are less than high watermark. This logic makes sense but could be more conservative. This patch doesn't try to fix the kswapd imbalance issue above, but make get_scan_count more conservative to select anon pages, so relieve the swap issue a little bit. The protect mechanism is designed for situation file pages are rotated frequently. In that situation, page reclaim should be in trouble, eg, priority is lower. So let's only apply the protect mechanism in that situation. If all zones are low in file cache, the protect is applied too to avoid cache trash. In pratice, this fixes the swap issue in above test. Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Cc: Mel Gorman <mgorman@xxxxxxx> Cc: Rik van Riel <riel@xxxxxxxxxx> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> Signed-off-by: Shaohua Li <shli@xxxxxx> --- mm/vmscan.c | 42 ++++++++++++++++++++++++++++++++++-------- 1 file changed, 34 insertions(+), 8 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8eadd..7046e13 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -99,6 +99,9 @@ struct scan_control { /* One of the zones is ready for compaction */ unsigned int compaction_ready:1; + /* file pages are low, must reclaim anon pages */ + unsigned int force_scan_anon:1; + /* Incremented by the number of inactive pages that were scanned */ unsigned long nr_scanned; @@ -1908,6 +1911,18 @@ enum scan_balance { SCAN_FILE, }; +static bool zone_force_scan_anon(struct zone *zone) +{ + unsigned long zonefile; + unsigned long zonefree; + + zonefree = zone_page_state(zone, NR_FREE_PAGES); + zonefile = zone_page_state(zone, NR_ACTIVE_FILE) + + zone_page_state(zone, NR_INACTIVE_FILE); + + return zonefile + zonefree <= high_wmark_pages(zone); +} + /* * Determine how aggressively the anon and file LRU lists should be * scanned. The relative value of each set of LRU lists is determined @@ -1991,14 +2006,9 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness, * anon pages. Try to detect this based on file LRU size. */ if (global_reclaim(sc)) { - unsigned long zonefile; - unsigned long zonefree; - - zonefree = zone_page_state(zone, NR_FREE_PAGES); - zonefile = zone_page_state(zone, NR_ACTIVE_FILE) + - zone_page_state(zone, NR_INACTIVE_FILE); - - if (unlikely(zonefile + zonefree <= high_wmark_pages(zone))) { + if (unlikely(zone_force_scan_anon(zone) && + (sc->force_scan_anon || + sc->priority < DEF_PRIORITY - 2))) { scan_balance = SCAN_ANON; goto out; } @@ -2473,6 +2483,22 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) if (buffer_heads_over_limit) sc->gfp_mask |= __GFP_HIGHMEM; + if (global_reclaim(sc)) { + int force_scan_anon = 1; + for_each_zone_zonelist_nodemask(zone, z, zonelist, + requested_highidx, sc->nodemask) { + if (!populated_zone(zone)) + continue; + if (!cpuset_zone_allowed(zone, + GFP_KERNEL | __GFP_HARDWALL)) + continue; + force_scan_anon &= zone_force_scan_anon(zone); + if (!force_scan_anon) + break; + } + sc->force_scan_anon = force_scan_anon; + } + for_each_zone_zonelist_nodemask(zone, z, zonelist, requested_highidx, sc->nodemask) { enum zone_type classzone_idx; -- 1.8.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>