Re: Possible deadloop in direct reclaim?

Minchan Kim <minchan@xxxxxxxxxx> · Fri, 2 Aug 2013 11:33:26 +0900



On Fri, Aug 02, 2013 at 11:26:28AM +0900, Minchan Kim wrote:
> Hello Lisa and KOSAKI,
> 
> Lisa's quote style is very hard to follow so I'd like to write at bottom
> as ignoring line by line rule.
> 
> Lisa, please correct your MUA.
> 
> On Thu, Aug 01, 2013 at 05:42:59PM +0900, Minchan Kim wrote:
> > On Thu, Aug 01, 2013 at 01:20:34AM -0700, Lisa Du wrote:
> > > >-----Original Message-----
> > > >From: Minchan Kim [mailto:minchan@xxxxxxxxxx]
> > > >Sent: 2013年8月1日 15:34
> > > >To: Lisa Du
> > > >Cc: linux-mm@xxxxxxxxx; KOSAKI Motohiro
> > > >Subject: Re: Possible deadloop in direct reclaim?
> > > >
> > > >On Wed, Jul 31, 2013 at 11:13:07PM -0700, Lisa Du wrote:
> > > >> >On Mon, Jul 22, 2013 at 09:58:17PM -0700, Lisa Du wrote:
> > > >> >> Dear Sir:
> > > >> >> Currently I met a possible deadloop in direct reclaim. After run plenty
> > > >of
> > > >> >the application, system run into a status that system memory is very
> > > >> >fragmentized. Like only order-0 and order-1 memory left.
> > > >> >> Then one process required a order-2 buffer but it enter an endless
> > > >direct
> > > >> >reclaim. From my trace log, I can see this loop already over 200,000
> > > >times.
> > > >> >Kswapd was first wake up and then go back to sleep as it cannot
> > > >rebalance
> > > >> >this order's memory. But zone->all_unreclaimable remains 1.
> > > >> >> Though direct_reclaim every time returns no pages, but as
> > > >> >zone->all_unreclaimable = 1, so it loop again and again. Even when
> > > >> >zone->pages_scanned also becomes very large. It will block the process
> > > >for
> > > >> >long time, until some watchdog thread detect this and kill this process.
> > > >> >Though it's in __alloc_pages_slowpath, but it's too slow right? Maybe
> > > >cost
> > > >> >over 50 seconds or even more.
> > > >> >> I think it's not as expected right?  Can we also add below check in the
> > > >> >function all_unreclaimable() to terminate this loop?
> > > >> >>
> > > >> >> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
> > > >> >*zonelist,
> > > >> >>                         continue;
> > > >> >>                 if (!zone->all_unreclaimable)
> > > >> >>                         return false;
> > > >> >> +               if (sc->nr_reclaimed == 0
> > > >&& !zone_reclaimable(zone))
> > > >> >> +                       return true;
> > > >> >>         }
> > > >> >>          BTW: I'm using kernel3.4, I also try to search in the
> > > >kernel3.9,
> > > >> >didn't see a possible fix for such issue. Or is anyone also met such issue
> > > >> >before? Any comment will be welcomed, looking forward to your reply!
> > > >> >>
> > > >> >> Thanks!
> > > >> >
> > > >> >I'd like to ask somethigs.
> > > >> >
> > > >> >1. Do you have enabled swap?
> > > >> I set CONFIG_SWAP=y, but I didn't really have a swap partition, that
> > > >means my swap buffer size is 0;
> > > >> >2. Do you enable CONFIG_COMPACTION?
> > > >> No, I didn't enable;
> > > >> >3. Could we get your zoneinfo via cat /proc/zoneinfo?
> > > >> I dump some info from ramdump, please review:
> > > >
> > > >Thanks for the information.
> > > >You said order-2 allocation was failed so I will assume preferred zone
> > > >is normal zone, not high zone because high order allocation in kernel side
> > > >isn't from high zone.
> > > Yes, that's right!
> > > >
> > > >> crash> kmem -z
> > > >> NODE: 0  ZONE: 0  ADDR: c08460c0  NAME: "Normal"
> > > >>   SIZE: 192512  PRESENT: 182304  MIN/LOW/HIGH: 853/1066/1279
> > > >
> > > >712M normal memory.
> > > >
> > > >>   VM_STAT:
> > > >>           NR_FREE_PAGES: 16092
> > > >
> > > >There are plenty of free pages over high watermark but there are heavy
> > > >fragmentation as I see below information.
> > > >
> > > >So, kswapd doesn't scan this zone loop iteration is done with order-2.
> > > >I mean kswapd will scan this zone with order-0 if first iteration is
> > > >done by this
> > > >
> > > >        order = sc.order = 0;
> > > >
> > > >        goto loop_again;
> > > >
> > > >But this time, zone_watermark_ok_safe with testorder = 0 on normal zone
> > > >is always true so that scanning of zone will be skipped. It means kswapd
> > > >never set zone->unreclaimable to 1.
> > > Yes, definitely!
> > > >
> > > >>        NR_INACTIVE_ANON: 17
> > > >>          NR_ACTIVE_ANON: 55091
> > > >>        NR_INACTIVE_FILE: 17
> > > >>          NR_ACTIVE_FILE: 17
> > > >>          NR_UNEVICTABLE: 0
> > > >>                NR_MLOCK: 0
> > > >>           NR_ANON_PAGES: 55077
> > > >
> > > >There are about 200M anon pages and few file pages.
> > > >You don't have swap so that reclaimer couldn't go far.
> > > >
> > > >>          NR_FILE_MAPPED: 42
> > > >>           NR_FILE_PAGES: 69
> > > >>           NR_FILE_DIRTY: 0
> > > >>            NR_WRITEBACK: 0
> > > >>     NR_SLAB_RECLAIMABLE: 1226
> > > >>   NR_SLAB_UNRECLAIMABLE: 9373
> > > >>            NR_PAGETABLE: 2776
> > > >>         NR_KERNEL_STACK: 798
> > > >>         NR_UNSTABLE_NFS: 0
> > > >>               NR_BOUNCE: 0
> > > >>         NR_VMSCAN_WRITE: 91
> > > >>     NR_VMSCAN_IMMEDIATE: 115381
> > > >>       NR_WRITEBACK_TEMP: 0
> > > >>        NR_ISOLATED_ANON: 0
> > > >>        NR_ISOLATED_FILE: 0
> > > >>                NR_SHMEM: 31
> > > >>              NR_DIRTIED: 15256
> > > >>              NR_WRITTEN: 11981
> > > >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
> > > >>
> > > >> NODE: 0  ZONE: 1  ADDR: c08464c0  NAME: "HighMem"
> > > >>   SIZE: 69632  PRESENT: 69088  MIN/LOW/HIGH: 67/147/228
> > > >>   VM_STAT:
> > > >>           NR_FREE_PAGES: 161
> > > >
> > > >Reclaimer should reclaim this zone.
> > > >
> > > >>        NR_INACTIVE_ANON: 104
> > > >>          NR_ACTIVE_ANON: 46114
> > > >>        NR_INACTIVE_FILE: 9722
> > > >>          NR_ACTIVE_FILE: 12263
> > > >
> > > >It seems there are lots of room to evict file pages.
> > > >
> > > >>          NR_UNEVICTABLE: 168
> > > >>                NR_MLOCK: 0
> > > >>           NR_ANON_PAGES: 46102
> > > >>          NR_FILE_MAPPED: 12227
> > > >>           NR_FILE_PAGES: 22270
> > > >>           NR_FILE_DIRTY: 1
> > > >>            NR_WRITEBACK: 0
> > > >>     NR_SLAB_RECLAIMABLE: 0
> > > >>   NR_SLAB_UNRECLAIMABLE: 0
> > > >>            NR_PAGETABLE: 0
> > > >>         NR_KERNEL_STACK: 0
> > > >>         NR_UNSTABLE_NFS: 0
> > > >>               NR_BOUNCE: 0
> > > >>         NR_VMSCAN_WRITE: 0
> > > >>     NR_VMSCAN_IMMEDIATE: 0
> > > >>       NR_WRITEBACK_TEMP: 0
> > > >>        NR_ISOLATED_ANON: 0
> > > >>        NR_ISOLATED_FILE: 0
> > > >>                NR_SHMEM: 117
> > > >>              NR_DIRTIED: 7364
> > > >>              NR_WRITTEN: 6989
> > > >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
> > > >>
> > > >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
> > > >START_MAPNR
> > > >>   0   Normal    192512   16092  c1200000       0            0
> > > >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
> > > >>   0       4k      c08460f0           3      3
> > > >>   0       4k      c08460f8         436    436
> > > >>   0       4k      c0846100       15237  15237
> > > >>   0       4k      c0846108           0      0
> > > >>   0       4k      c0846110           0      0
> > > >>   1       8k      c084611c          39     78
> > > >>   1       8k      c0846124           0      0
> > > >>   1       8k      c084612c         169    338
> > > >>   1       8k      c0846134           0      0
> > > >>   1       8k      c084613c           0      0
> > > >>   2      16k      c0846148           0      0
> > > >>   2      16k      c0846150           0      0
> > > >>   2      16k      c0846158           0      0
> > > >> ---------Normal zone all order > 1 has no free pages
> > > >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
> > > >START_MAPNR
> > > >>   1   HighMem    69632     161  c17e0000    2f000000
> > > >192512
> > > >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
> > > >>   0       4k      c08464f0          12     12
> > > >>   0       4k      c08464f8           0      0
> > > >>   0       4k      c0846500          14     14
> > > >>   0       4k      c0846508           3      3
> > > >>   0       4k      c0846510           0      0
> > > >>   1       8k      c084651c           0      0
> > > >>   1       8k      c0846524           0      0
> > > >>   1       8k      c084652c           0      0
> > > >>   2      16k      c0846548           0      0
> > > >>   2      16k      c0846550           0      0
> > > >>   2      16k      c0846558           0      0
> > > >>   2      16k      c0846560           1      4
> > > >>   2      16k      c0846568           0      0
> > > >>   5     128k      c08465cc           0      0
> > > >>   5     128k      c08465d4           0      0
> > > >>   5     128k      c08465dc           0      0
> > > >>   5     128k      c08465e4           4    128
> > > >>   5     128k      c08465ec           0      0
> > > >> ------Other's all zero
> > > >>
> > > >> Some other zone information I dump from pglist_data
> > > >> {
> > > >> 	watermark = {853, 1066, 1279},
> > > >>       percpu_drift_mark = 0,
> > > >>       lowmem_reserve = {0, 2159, 2159},
> > > >>       dirty_balance_reserve = 3438,
> > > >>       pageset = 0xc07f6144,
> > > >>       lock = {
> > > >>         {
> > > >>           rlock = {
> > > >>             raw_lock = {
> > > >>               lock = 0
> > > >>             },
> > > >>             break_lock = 0
> > > >>           }
> > > >>         }
> > > >>       },
> > > >> 	all_unreclaimable = 0,
> > > >>       reclaim_stat = {
> > > >>         recent_rotated = {903355, 960912},
> > > >>         recent_scanned = {932404, 2462017}
> > > >>       },
> > > >>       pages_scanned = 84231,
> > > >
> > > >Most of scan happens in direct reclaim path, I guess
> > > >but direct reclaim couldn't reclaim any pages due to lack of swap device.
> > > >
> > > >It means we have to set zone->all_unreclaimable in direct reclaim path,
> > > >too.
> > > >Below patch fix your problem?
> > > Yes, your patch should fix my problem! 
> > > Actually I also did another patch, after test, should also fix my issue, 
> > > but I didn't set zone->all_unreclaimable in direct reclaim path as you, 
> > > just double check zone_reclaimable() status in all_unreclaimable() function.
> > > Maybe your patch is better!
> > 
> > Nope. I think your patch is better. :)
> > Just thing is anlaysis of the problem and description and I think we could do
> > better but unfortunately, I don't have enough time today so I will see tomorrow.
> > Just nitpick below.
> > 
> > Thanks.
> > 
> > > 
> > > commit 26d2b60d06234683a81666da55129f9c982271a5
> > > Author: Lisa Du <cldu@xxxxxxxxxxx>
> > > Date:   Thu Aug 1 10:16:32 2013 +0800
> > > 
> > >     mm: fix infinite direct_reclaim when memory is very fragmentized
> > >     
> > >     latest all_unreclaimable check in direct reclaim is the following commit.
> > >     2011 Apr 14; commit 929bea7c; vmscan:  all_unreclaimable() use
> > >                                 zone->all_unreclaimable as a name
> > >     and in addition, add oom_killer_disabled check to avoid reintroduce the
> > >     issue of commit d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
> > >     
> > >     But except the hibernation case in which kswapd is freezed, there's also other case
> > >     which may lead infinite loop in direct relaim. In a real test, direct_relaimer did
> > >     over 200000 times rebalance in __alloc_pages_slowpath(), so this process will be
> > >     blocked until watchdog detect and kill it. The root cause is as below:
> > >     
> > >     If system memory is very fragmentized like only order-0 and order-1 left,
> > >     kswapd will go to sleep as system cann't rebalanced for high-order allocations.
> > >     But direct_reclaim still works for higher order request. So zones can become a state
> > >     zone->all_unreclaimable = 0 but zone->pages_scanned > zone_reclaimable_pages(zone) * 6.
> > >     In this case if a process like do_fork try to allocate an order-2 memory which is not
> > >     a COSTLY_ORDER, as direct_reclaim always said it did_some_progress, so rebalance again
> > >     and again in __alloc_pages_slowpath(). This issue is easily happen in no swap and no
> > >     compaction enviroment.
> > >     
> > >     So add furthur check in all_unreclaimable() to avoid such case.
> > >     
> > >     Change-Id: Id3266b47c63f5b96aab466fd9f1f44d37e16cdcb
> > >     Signed-off-by: Lisa Du <cldu@xxxxxxxxxxx>
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 2cff0d4..34582d9 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2301,7 +2301,9 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> > >                         continue;
> > >                 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > >                         continue;
> > > -               if (!zone->all_unreclaimable)
> > > +               if (zone->all_unreclaimable)
> > > +                       continue;
> > 
> > Nitpick: If we use zone_reclaimable(), above check is redundant and
> > gain is very tiny because this path is already slow.
> > 
> > > +               if (zone_reclaimable(zone))
> > >                         return false;
> > >         }
> > > >
> > > >From a5d82159b98f3d90c2f9ff9e486699fb4c67cced Mon Sep 17 00:00:00
> > > >2001
> > > >From: Minchan Kim <minchan@xxxxxxxxxx>
> > > >Date: Thu, 1 Aug 2013 16:18:00 +0900
> > > >Subject:[PATCH] mm: set zone->all_unreclaimable in direct reclaim
> > > > path
> > > >
> > > >Lisa reported there are lots of free pages in a zone but most of them
> > > >is order-0 pages so it means the zone is heavily fragemented.
> > > >Then, high order allocation could make direct reclaim path'slong stall(
> > > >ex, 50 second) in no swap and no compaction environment.
> > > >
> > > >The reason is kswapd can skip the zone's scanning because the zone
> > > >is lots of free pages and kswapd changes scanning order from high-order
> > > >to 0-order after his first iteration is done because kswapd think
> > > >order-0 allocation is the most important.
> > > >Look at 73ce02e9 in detail.
> > > >
> > > >The problem from that is that only kswapd can set zone->all_unreclaimable
> > > >to 1 at the moment so direct reclaim path should loop forever until a ghost
> > > >can set the zone->all_unreclaimable to 1.
> > > >
> > > >This patch makes direct reclaim path to set zone->all_unreclaimable
> > > >to avoid infinite loop. So now we don't need a ghost.
> > > >
> > > >Reported-by: Lisa Du <cldu@xxxxxxxxxxx>
> > > >Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > > >---
> > > > mm/vmscan.c |   29 ++++++++++++++++++++++++++++-
> > > > 1 file changed, 28 insertions(+), 1 deletion(-)
> > > >
> > > >diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > >index 33dc256..f957e87 100644
> > > >--- a/mm/vmscan.c
> > > >+++ b/mm/vmscan.c
> > > >@@ -2317,6 +2317,23 @@ static bool all_unreclaimable(struct zonelist
> > > >*zonelist,
> > > > 	return true;
> > > > }
> > > >
> > > >+static void check_zones_unreclaimable(struct zonelist *zonelist,
> > > >+					struct scan_control *sc)
> > > >+{
> > > >+	struct zoneref *z;
> > > >+	struct zone *zone;
> > > >+
> > > >+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > >+			gfp_zone(sc->gfp_mask), sc->nodemask) {
> > > >+		if (!populated_zone(zone))
> > > >+			continue;
> > > >+		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > > >+			continue;
> > > >+		if (!zone_reclaimable(zone))
> > > >+			zone->all_unreclaimable = 1;
> > > >+	}
> > > >+}
> > > >+
> > > > /*
> > > >  * This is the main entry point to direct page reclaim.
> > > >  *
> > > >@@ -2370,7 +2387,17 @@ static unsigned long
> > > >do_try_to_free_pages(struct zonelist *zonelist,
> > > > 				lru_pages += zone_reclaimable_pages(zone);
> > > > 			}
> > > >
> > > >-			shrink_slab(shrink, sc->nr_scanned, lru_pages);
> > > >+			/*
> > > >+			 * When a zone has enough order-0 free memory but
> > > >+			 * zone is heavily fragmented and we need high order
> > > >+			 * page from the zone, kswapd could skip the zone
> > > >+			 * after first iteration with high order. So, kswapd
> > > >+			 * never set the zone->all_unreclaimable to 1 so
> > > >+			 * direct reclaim path needs the check.
> > > >+			 */
> > > >+			if (!shrink_slab(shrink, sc->nr_scanned, lru_pages))
> > > >+				check_zones_unreclaimable(zonelist, sc);
> > > >+
> > > > 			if (reclaim_state) {
> > > > 				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > > > 				reclaim_state->reclaimed_slab = 0;
> > > >--
> > > >1.7.9.5
> > > >
> > > >--
> > > >Kind regards,
> > > >Minchan Kim
> > 
> 
> I reviewed current mmotm because recently Mel changed kswapd a lot and
> all_unreclaimable patch history today.
> What I see is recent mmotm has a same problem, too if system have no swap
> and no compaction. Of course, compaction is default yes option so we could
> recommend to enable if system works well but it's up to user and we should
                                ^^^ typo
                      if system want work well

> avoid direct reclaim hang although user disable compaction.
> 
> When I see the patch history, real culprit is 929bea7c.
> 
> "  zone->all_unreclaimable and zone->pages_scanned are neigher atomic
>     variables nor protected by lock.  Therefore zones can become a state of
>     zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
>     all_unreclaimable() return false even though zone->all_unreclaimabe=1."
> 
> I understand the problem but apparently, it makes Lisa's problem because
> kswapd can give up balancing when high order allocation happens to prevent
> excessive reclaim with assuming the process requested high order allocation
> can do direct reclaim/compaction. But what if the process can't reclaim
> by no swap but lots of anon pages and can't compact by !CONFIG_COMPACTION?
> 
> In such system, OOM kill is natural but not hang.
> So, a solution we can fix simply introduces zone_reclaimable check again in
> all_unreclaimabe() like this.
> 
> What do you think about it?
> 
> It's a same patch Lisa posted so we should give a credit
> to her/him(Sorry I'm not sure) if we agree thie approach.
> 
> Lisa, If KOSAKI agree with this, could you resend this patch with your SOB?
> 
> Thanks.
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a3bf7fd..78f46d8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2367,7 +2367,15 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>  			continue;
>  		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>  			continue;
> -		if (!zone->all_unreclaimable)
> +		/*
> +		 * zone->page_scanned and could be raced so we need
                                      ^^^
                                      typo: please remove it.

> +		 * dobule check by zone->all_unreclaimable. Morever, kswapd
> +		 * could skip (zone->all_unreclaimable = 1) if the zone
> +		 * is heavily fragmented but enough free pages to meet
> +		 * high watermark. In such case, kswapd never set
> +		 * all_unreclaimable to 1 so we need zone_reclaimable, too.
> +		 */
> +		if (!zone->all_unreclaimable || zone_reclaimable(zone))
>  			return false;
>  	}
>  
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>