Re: [patch 1/2]vmscan: correct all_unreclaimable for zone without lru pages

Minchan Kim <minchan.kim@xxxxxxxxx> · Sat, 1 Oct 2011 15:59:43 +0900

On Fri, Sep 30, 2011 at 10:12:23AM +0800, Shaohua Li wrote:
> On Thu, 2011-09-29 at 17:18 +0800, Minchan Kim wrote:
> > On Thu, Sep 29, 2011 at 09:14:51AM +0800, Shaohua Li wrote:
> > > On Thu, 2011-09-29 at 01:57 +0800, Minchan Kim wrote:
> > > > On Wed, Sep 28, 2011 at 03:08:31PM +0800, Shaohua Li wrote:
> > > > > On Wed, 2011-09-28 at 14:57 +0800, Minchan Kim wrote:
> > > > > > On Tue, Sep 27, 2011 at 03:23:04PM +0800, Shaohua Li wrote:
> > > > > > > I saw DMA zone always has ->all_unreclaimable set. The reason is the high zones
> > > > > > > are big, so zone_watermark_ok/_safe() will always return false with a high
> > > > > > > classzone_idx for DMA zone, because DMA zone's lowmem_reserve is big for a high
> > > > > > > classzone_idx. When kswapd runs into DMA zone, it doesn't scan/reclaim any
> > > > > > > pages(no pages in lru), but mark the zone as all_unreclaimable. This can
> > > > > > > happen in other low zones too.
> > > > > > 
> > > > > > Good catch!
> > > > > > 
> > > > > > > This is confusing and can potentially cause oom. Say a low zone has
> > > > > > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > > > > > some pages in low zone(for example reading blkdev with highmem support),
> > > > > > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > > > > > > direct reclaim might reclaim nothing and an oom reported. If
> > > > > > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > > > > > > If all_unreclaimable is unset, in the inner loop of balance_pgdat we always have
> > > > > > > all_zones_ok 0 when checking a low zone's watermark. If high zone watermark isn't
> > > > > > > good, there is no problem. Otherwise, we might loop one more time in the outer
> > > > > > > loop, but since high zone watermark is ok, the end_zone will be lower, then low
> > > > > > > zone's watermark check will be ok and the outer loop will break. So looks this
> > > > > > > doesn't bring any problem.
> > > > > > 
> > > > > > I think it would be better to correct zone_reclaimable.
> > > > > > My point is zone_reclaimable should consider zone->pages_scanned.
> > > > > > The point of the function is how many pages scanned VS how many pages remained in LRU.
> > > > > > If reclaimer doesn't scan the zone at all because of no lru pages, it shouldn't tell
> > > > > > the zone is all_unreclaimable.
> > > > > actually this is exact my first version of the patch. The problem is if
> > > > > a zone is true unreclaimable (used by kenrel pages or whatever), we will
> > > > > have zone->pages_scanned 0 too. I thought we should set
> > > > > all_unreclaimable in this case.
> > > > 
> > > > Let's think the problem again.
> > > > Fundamental problem is that why the lower zone's lowmem_reserve for higher zone is huge big
> > > > that might be bigger than the zone's size.
> > > > I think we need the boundary for limiting lowmem_reseve.
> > > > So how about this?
> > > I didn't see a reason why high zone allocation should fallback to low
> > > zone if high zone is big. Changing the lowmem_reserve can cause the
> > > fallback. Has any rationale here?
> > 
> > I try to think better solution than yours but I got failed. :(
> > The why I try to avoid your patch is that kswapd is very complicated these days so
> > I wanted to not add more logic for handling corner cases if we can solve it
> > other ways. But as I said, but I got failed.
> > 
> > It seems that it doesn't make sense that previous my patch that limit lowmem_reserve.
> > Because we can have higher zone which is very big size so that lowmem_zone[higher_zone] of
> > low zone could be bigger freely than the lowmem itself size.
> > It implies the low zone should be not used for higher allocation.
> > It has no reason to limit it. My brain was broken. :(
> > 
> > But I have a question about your patch still.
> > What happens if DMA zone sets zone->all_unreclaimable with 1?
> > 
> > You said as follows,
> > 
> > > This is confusing and can potentially cause oom. Say a low zone has
> > > all_unreclaimable when high zone hasn't enough memory. Then allocating
> > > some pages in low zone(for example reading blkdev with highmem support),
> > > then run into direct reclaim. Since the zone has all_unreclaimable set,
> > 
> > If low zone has enough pages for allocation, it cannot have entered reclaim.
> > It means now low zone doesn't have enough free pages for the order allocation.
> > So it's natural to enter reclaim path.
> > 
> > > direct reclaim might reclaim nothing and an oom reported. If
> > 
> > It's not correct "nothing". At least, it will do something in DEF_PRIORITY.
> it does something, but might not reclaim any pages, for example, it
> starts page write, but page isn't in disk yet in DEF_PRIORITY and it
> skip further reclaiming in !DEF_PRIORITY.
> 
> > > all_unreclaimable is unset, the zone can actually reclaim some pages.
> > 
> > The reason of this problem is that the zone has no lru page, you said.
> > Then how could we reclaim some pages in the zone even if the zone's all_unreclaimable is unset?
> > You expect slab pages?
> The zone could have lru pages. Let's take an example, allocation from
> ZONE_HIGHMEM, then kswapd runs, ZONE_NORMAL gets all_unreclaimable set
> even it has free pages. Then we do write blkdev device, which use
> ZONE_NORMAL for page cache. Some pages in ZONE_NORMAL are in lru, then
> we run into direct page reclaim for ZONE_NORMAL. Since all_unreclaimable
> is set and pages in ZONE_NORMAL lru are dirty, direct reclaim could
> fail. But I'd agree this is a corner case.
> Besides when I saw ZONE_DMA has a lot of free pages and
> all_unreclaimable is set, it's really confusing.

Hi Shaohua,
Sorry for late response and Thanks for your explanation.
It's valuable to fix, I think.
How about this?

I hope other guys have a interest in the problem.
Cced them.

>From 070d5b1a69921bc71c6aaa5445fb1d29ecb38f74 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan.kim@xxxxxxxxx>
Date: Sat, 1 Oct 2011 15:26:08 +0900
Subject: [RFC] vmscan: set all_unreclaimable of zone carefully

Shaohua Li reported all_unreclaimable of DMA zone is always set
because the system has a big memory HIGH zone so that lowmem_reserve[HIGH]
could be a big.

It could be a problem as follows

Assumption :
1. The system has a big high memory so that lowmem_reserve[HIGH] of DMA zone would be big.
2. HIGH/NORMAL zone are full but DMA zone has enough free pages.

Scenario
1. A request to allocate a page in HIGH zone.
2. HIGH/NORMAL zone already consumes lots of pages so that it would be fall-backed to DMA zone.
3. In DMA zone, allocator got failed, too becuase lowmem_reserve[HIGH] is very big so that it wakes up kswapd
4. kswapd would call shrink_zone while it see DMA zone since DMA zone's lowmem_reserve[HIGHMEM]
   would be big so that it couldn't meet zone_watermark_ok_safe(high_wmark_pages(zone) + balance_gap,
   *end_zone*)
5. DMA zone doesn't meet stop condition(nr_slab != 0, !zone_reclaimable) because the zone has small lru pages
   and it doesn't have slab pages so that kswapd would set all_unreclaimable of the zone to *1* easily.
6. B request to allocate many pages in NORMAL zone but NORMAL zone has no free pages
   so that it would be fall-backed to DMA zone.
7. DMA zone would allocates many pages for NORMAL zone because lowmem_reserve[NORMAL] is small.
   These pages are used by application(ie, it menas LRU pages. Yes. Now DMA zone could have many reclaimable pages)
8. C request to allocate a page in NORMAL zone but he got failed because DMA zone doesn't have enough free pages.
   (Most of pages in DMA zone are consumed by B)
9. Kswapd try to reclaim lru pages in DMA zone but got failed because all_unreclaimable of the zone is 1. Otherwise,
   it could reclaim many pages which are used by B.

Of coures, we can do something in DEF_PRIORITY but it couldn't do enough because it can't raise
synchronus reclaim in direct reclaim path if the zone has many dirty pages
so that the process is killed by OOM.

The principal problem is caused by step 8.
In step 8, we increased # of lru size very much but still the zone->all_unreclaimable is 1.
If we increase lru size, it is valuable to try reclaiming again.
The rationale is that we reset all_unreclaimable to 0 even if we free just a one page.

Cc: Mel Gorman <mel@xxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxx>
Cc: Johannes Weiner <jweiner@xxxxxxxxxx>
Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
Reported-by: Shaohua Li <shaohua.li@xxxxxxxxx>
Signed-off-by: Minchan Kim <minchan.kim@xxxxxxxxx>
---
 include/linux/mmzone.h |    6 +++++-
 mm/vmscan.c            |   30 ++++++++++++++++++++++++------
 mm/vmstat.c            |    2 +-
 3 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1ed4116..18e3f93 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -330,7 +330,11 @@ struct zone {
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	int                     all_unreclaimable; /* All pages pinned */
+	/*
+	 * When we found zone was all_unreclaimable,
+	 * we records # of lru pages at that time into all_unreclaimable.
+	 */
+	unsigned long           all_unreclaimable; /* All pages pinned */
 #ifdef CONFIG_MEMORY_HOTPLUG
 	/* see spanned/present_pages for more description */
 	seqlock_t		span_seqlock;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 23256e8..a2d156e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2092,6 +2092,18 @@ restart:
 }
 
 /*
+ * Compare current lru size of the zone to old lru size when we found it
+ * had no lru pages which could be reclaimed.
+ */
+static inline bool zone_unreclaimable(struct zone *zone)
+{
+	bool ret = true;
+	if (zone->all_unreclaimable < zone_reclaimable_pages(zone) + 1)
+		ret = false;
+	return ret;
+}
+
+/*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zones which will satisfy the caller's allocation
  * request.
@@ -2126,7 +2138,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 		if (scanning_global_lru(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+			if (zone_unreclaimable(zone) && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
 			/*
 			 * This steals pages from memory cgroups over softlimit
@@ -2165,7 +2177,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
 			continue;
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
-		if (!zone->all_unreclaimable)
+		if (!zone_unreclaimable(zone))
 			return false;
 	}
 
@@ -2461,7 +2473,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 		 * they must be considered balanced here as well if kswapd
 		 * is to sleep
 		 */
-		if (zone->all_unreclaimable) {
+		if (zone_unreclaimable(zone)) {
 			balanced += zone->present_pages;
 			continue;
 		}
@@ -2559,7 +2571,7 @@ loop_again:
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+			if (zone_unreclaimable(zone) && priority != DEF_PRIORITY)
 				continue;
 
 			/*
@@ -2605,7 +2617,7 @@ loop_again:
 			if (!populated_zone(zone))
 				continue;
 
-			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+			if (zone_unreclaimable(zone) && priority != DEF_PRIORITY)
 				continue;
 
 			sc.nr_scanned = 0;
@@ -2643,7 +2655,13 @@ loop_again:
 				total_scanned += sc.nr_scanned;
 
 				if (nr_slab == 0 && !zone_reclaimable(zone))
-					zone->all_unreclaimable = 1;
+					/*
+					 * If we look zone's lru size is increased
+					 * compared to current, it is valuable to try
+					 * reclaimaing again.
+					 */
+					zone->all_unreclaimable =
+						zone_reclaimable_pages(zone) + 1;
 			}
 
 			/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 471b20b..bb49f11 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1016,7 +1016,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 #endif
 	}
 	seq_printf(m,
-		   "\n  all_unreclaimable: %u"
+		   "\n  all_unreclaimable: %lu"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
 		   zone->all_unreclaimable,
-- 
1.7.4.1



> 
> Thanks,
> Shaohua
> 

-- 
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>