Re: Possible deadloop in direct reclaim?

Minchan Kim <minchan@xxxxxxxxxx> · Fri, 2 Aug 2013 11:26:28 +0900

Hello Lisa and KOSAKI,

Lisa's quote style is very hard to follow so I'd like to write at bottom
as ignoring line by line rule.

Lisa, please correct your MUA.

On Thu, Aug 01, 2013 at 05:42:59PM +0900, Minchan Kim wrote:
> On Thu, Aug 01, 2013 at 01:20:34AM -0700, Lisa Du wrote:
> > >-----Original Message-----
> > >From: Minchan Kim [mailto:minchan@xxxxxxxxxx]
> > >Sent: 2013年8月1日 15:34
> > >To: Lisa Du
> > >Cc: linux-mm@xxxxxxxxx; KOSAKI Motohiro
> > >Subject: Re: Possible deadloop in direct reclaim?
> > >
> > >On Wed, Jul 31, 2013 at 11:13:07PM -0700, Lisa Du wrote:
> > >> >On Mon, Jul 22, 2013 at 09:58:17PM -0700, Lisa Du wrote:
> > >> >> Dear Sir:
> > >> >> Currently I met a possible deadloop in direct reclaim. After run plenty
> > >of
> > >> >the application, system run into a status that system memory is very
> > >> >fragmentized. Like only order-0 and order-1 memory left.
> > >> >> Then one process required a order-2 buffer but it enter an endless
> > >direct
> > >> >reclaim. From my trace log, I can see this loop already over 200,000
> > >times.
> > >> >Kswapd was first wake up and then go back to sleep as it cannot
> > >rebalance
> > >> >this order's memory. But zone->all_unreclaimable remains 1.
> > >> >> Though direct_reclaim every time returns no pages, but as
> > >> >zone->all_unreclaimable = 1, so it loop again and again. Even when
> > >> >zone->pages_scanned also becomes very large. It will block the process
> > >for
> > >> >long time, until some watchdog thread detect this and kill this process.
> > >> >Though it's in __alloc_pages_slowpath, but it's too slow right? Maybe
> > >cost
> > >> >over 50 seconds or even more.
> > >> >> I think it's not as expected right?  Can we also add below check in the
> > >> >function all_unreclaimable() to terminate this loop?
> > >> >>
> > >> >> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
> > >> >*zonelist,
> > >> >>                         continue;
> > >> >>                 if (!zone->all_unreclaimable)
> > >> >>                         return false;
> > >> >> +               if (sc->nr_reclaimed == 0
> > >&& !zone_reclaimable(zone))
> > >> >> +                       return true;
> > >> >>         }
> > >> >>          BTW: I'm using kernel3.4, I also try to search in the
> > >kernel3.9,
> > >> >didn't see a possible fix for such issue. Or is anyone also met such issue
> > >> >before? Any comment will be welcomed, looking forward to your reply!
> > >> >>
> > >> >> Thanks!
> > >> >
> > >> >I'd like to ask somethigs.
> > >> >
> > >> >1. Do you have enabled swap?
> > >> I set CONFIG_SWAP=y, but I didn't really have a swap partition, that
> > >means my swap buffer size is 0;
> > >> >2. Do you enable CONFIG_COMPACTION?
> > >> No, I didn't enable;
> > >> >3. Could we get your zoneinfo via cat /proc/zoneinfo?
> > >> I dump some info from ramdump, please review:
> > >
> > >Thanks for the information.
> > >You said order-2 allocation was failed so I will assume preferred zone
> > >is normal zone, not high zone because high order allocation in kernel side
> > >isn't from high zone.
> > Yes, that's right!
> > >
> > >> crash> kmem -z
> > >> NODE: 0  ZONE: 0  ADDR: c08460c0  NAME: "Normal"
> > >>   SIZE: 192512  PRESENT: 182304  MIN/LOW/HIGH: 853/1066/1279
> > >
> > >712M normal memory.
> > >
> > >>   VM_STAT:
> > >>           NR_FREE_PAGES: 16092
> > >
> > >There are plenty of free pages over high watermark but there are heavy
> > >fragmentation as I see below information.
> > >
> > >So, kswapd doesn't scan this zone loop iteration is done with order-2.
> > >I mean kswapd will scan this zone with order-0 if first iteration is
> > >done by this
> > >
> > >        order = sc.order = 0;
> > >
> > >        goto loop_again;
> > >
> > >But this time, zone_watermark_ok_safe with testorder = 0 on normal zone
> > >is always true so that scanning of zone will be skipped. It means kswapd
> > >never set zone->unreclaimable to 1.
> > Yes, definitely!
> > >
> > >>        NR_INACTIVE_ANON: 17
> > >>          NR_ACTIVE_ANON: 55091
> > >>        NR_INACTIVE_FILE: 17
> > >>          NR_ACTIVE_FILE: 17
> > >>          NR_UNEVICTABLE: 0
> > >>                NR_MLOCK: 0
> > >>           NR_ANON_PAGES: 55077
> > >
> > >There are about 200M anon pages and few file pages.
> > >You don't have swap so that reclaimer couldn't go far.
> > >
> > >>          NR_FILE_MAPPED: 42
> > >>           NR_FILE_PAGES: 69
> > >>           NR_FILE_DIRTY: 0
> > >>            NR_WRITEBACK: 0
> > >>     NR_SLAB_RECLAIMABLE: 1226
> > >>   NR_SLAB_UNRECLAIMABLE: 9373
> > >>            NR_PAGETABLE: 2776
> > >>         NR_KERNEL_STACK: 798
> > >>         NR_UNSTABLE_NFS: 0
> > >>               NR_BOUNCE: 0
> > >>         NR_VMSCAN_WRITE: 91
> > >>     NR_VMSCAN_IMMEDIATE: 115381
> > >>       NR_WRITEBACK_TEMP: 0
> > >>        NR_ISOLATED_ANON: 0
> > >>        NR_ISOLATED_FILE: 0
> > >>                NR_SHMEM: 31
> > >>              NR_DIRTIED: 15256
> > >>              NR_WRITTEN: 11981
> > >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
> > >>
> > >> NODE: 0  ZONE: 1  ADDR: c08464c0  NAME: "HighMem"
> > >>   SIZE: 69632  PRESENT: 69088  MIN/LOW/HIGH: 67/147/228
> > >>   VM_STAT:
> > >>           NR_FREE_PAGES: 161
> > >
> > >Reclaimer should reclaim this zone.
> > >
> > >>        NR_INACTIVE_ANON: 104
> > >>          NR_ACTIVE_ANON: 46114
> > >>        NR_INACTIVE_FILE: 9722
> > >>          NR_ACTIVE_FILE: 12263
> > >
> > >It seems there are lots of room to evict file pages.
> > >
> > >>          NR_UNEVICTABLE: 168
> > >>                NR_MLOCK: 0
> > >>           NR_ANON_PAGES: 46102
> > >>          NR_FILE_MAPPED: 12227
> > >>           NR_FILE_PAGES: 22270
> > >>           NR_FILE_DIRTY: 1
> > >>            NR_WRITEBACK: 0
> > >>     NR_SLAB_RECLAIMABLE: 0
> > >>   NR_SLAB_UNRECLAIMABLE: 0
> > >>            NR_PAGETABLE: 0
> > >>         NR_KERNEL_STACK: 0
> > >>         NR_UNSTABLE_NFS: 0
> > >>               NR_BOUNCE: 0
> > >>         NR_VMSCAN_WRITE: 0
> > >>     NR_VMSCAN_IMMEDIATE: 0
> > >>       NR_WRITEBACK_TEMP: 0
> > >>        NR_ISOLATED_ANON: 0
> > >>        NR_ISOLATED_FILE: 0
> > >>                NR_SHMEM: 117
> > >>              NR_DIRTIED: 7364
> > >>              NR_WRITTEN: 6989
> > >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
> > >>
> > >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
> > >START_MAPNR
> > >>   0   Normal    192512   16092  c1200000       0            0
> > >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
> > >>   0       4k      c08460f0           3      3
> > >>   0       4k      c08460f8         436    436
> > >>   0       4k      c0846100       15237  15237
> > >>   0       4k      c0846108           0      0
> > >>   0       4k      c0846110           0      0
> > >>   1       8k      c084611c          39     78
> > >>   1       8k      c0846124           0      0
> > >>   1       8k      c084612c         169    338
> > >>   1       8k      c0846134           0      0
> > >>   1       8k      c084613c           0      0
> > >>   2      16k      c0846148           0      0
> > >>   2      16k      c0846150           0      0
> > >>   2      16k      c0846158           0      0
> > >> ---------Normal zone all order > 1 has no free pages
> > >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
> > >START_MAPNR
> > >>   1   HighMem    69632     161  c17e0000    2f000000
> > >192512
> > >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
> > >>   0       4k      c08464f0          12     12
> > >>   0       4k      c08464f8           0      0
> > >>   0       4k      c0846500          14     14
> > >>   0       4k      c0846508           3      3
> > >>   0       4k      c0846510           0      0
> > >>   1       8k      c084651c           0      0
> > >>   1       8k      c0846524           0      0
> > >>   1       8k      c084652c           0      0
> > >>   2      16k      c0846548           0      0
> > >>   2      16k      c0846550           0      0
> > >>   2      16k      c0846558           0      0
> > >>   2      16k      c0846560           1      4
> > >>   2      16k      c0846568           0      0
> > >>   5     128k      c08465cc           0      0
> > >>   5     128k      c08465d4           0      0
> > >>   5     128k      c08465dc           0      0
> > >>   5     128k      c08465e4           4    128
> > >>   5     128k      c08465ec           0      0
> > >> ------Other's all zero
> > >>
> > >> Some other zone information I dump from pglist_data
> > >> {
> > >> 	watermark = {853, 1066, 1279},
> > >>       percpu_drift_mark = 0,
> > >>       lowmem_reserve = {0, 2159, 2159},
> > >>       dirty_balance_reserve = 3438,
> > >>       pageset = 0xc07f6144,
> > >>       lock = {
> > >>         {
> > >>           rlock = {
> > >>             raw_lock = {
> > >>               lock = 0
> > >>             },
> > >>             break_lock = 0
> > >>           }
> > >>         }
> > >>       },
> > >> 	all_unreclaimable = 0,
> > >>       reclaim_stat = {
> > >>         recent_rotated = {903355, 960912},
> > >>         recent_scanned = {932404, 2462017}
> > >>       },
> > >>       pages_scanned = 84231,
> > >
> > >Most of scan happens in direct reclaim path, I guess
> > >but direct reclaim couldn't reclaim any pages due to lack of swap device.
> > >
> > >It means we have to set zone->all_unreclaimable in direct reclaim path,
> > >too.
> > >Below patch fix your problem?
> > Yes, your patch should fix my problem! 
> > Actually I also did another patch, after test, should also fix my issue, 
> > but I didn't set zone->all_unreclaimable in direct reclaim path as you, 
> > just double check zone_reclaimable() status in all_unreclaimable() function.
> > Maybe your patch is better!
> 
> Nope. I think your patch is better. :)
> Just thing is anlaysis of the problem and description and I think we could do
> better but unfortunately, I don't have enough time today so I will see tomorrow.
> Just nitpick below.
> 
> Thanks.
> 
> > 
> > commit 26d2b60d06234683a81666da55129f9c982271a5
> > Author: Lisa Du <cldu@xxxxxxxxxxx>
> > Date:   Thu Aug 1 10:16:32 2013 +0800
> > 
> >     mm: fix infinite direct_reclaim when memory is very fragmentized
> >     
> >     latest all_unreclaimable check in direct reclaim is the following commit.
> >     2011 Apr 14; commit 929bea7c; vmscan:  all_unreclaimable() use
> >                                 zone->all_unreclaimable as a name
> >     and in addition, add oom_killer_disabled check to avoid reintroduce the
> >     issue of commit d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
> >     
> >     But except the hibernation case in which kswapd is freezed, there's also other case
> >     which may lead infinite loop in direct relaim. In a real test, direct_relaimer did
> >     over 200000 times rebalance in __alloc_pages_slowpath(), so this process will be
> >     blocked until watchdog detect and kill it. The root cause is as below:
> >     
> >     If system memory is very fragmentized like only order-0 and order-1 left,
> >     kswapd will go to sleep as system cann't rebalanced for high-order allocations.
> >     But direct_reclaim still works for higher order request. So zones can become a state
> >     zone->all_unreclaimable = 0 but zone->pages_scanned > zone_reclaimable_pages(zone) * 6.
> >     In this case if a process like do_fork try to allocate an order-2 memory which is not
> >     a COSTLY_ORDER, as direct_reclaim always said it did_some_progress, so rebalance again
> >     and again in __alloc_pages_slowpath(). This issue is easily happen in no swap and no
> >     compaction enviroment.
> >     
> >     So add furthur check in all_unreclaimable() to avoid such case.
> >     
> >     Change-Id: Id3266b47c63f5b96aab466fd9f1f44d37e16cdcb
> >     Signed-off-by: Lisa Du <cldu@xxxxxxxxxxx>
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2cff0d4..34582d9 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2301,7 +2301,9 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> >                         continue;
> >                 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> >                         continue;
> > -               if (!zone->all_unreclaimable)
> > +               if (zone->all_unreclaimable)
> > +                       continue;
> 
> Nitpick: If we use zone_reclaimable(), above check is redundant and
> gain is very tiny because this path is already slow.
> 
> > +               if (zone_reclaimable(zone))
> >                         return false;
> >         }
> > >
> > >From a5d82159b98f3d90c2f9ff9e486699fb4c67cced Mon Sep 17 00:00:00
> > >2001
> > >From: Minchan Kim <minchan@xxxxxxxxxx>
> > >Date: Thu, 1 Aug 2013 16:18:00 +0900
> > >Subject:[PATCH] mm: set zone->all_unreclaimable in direct reclaim
> > > path
> > >
> > >Lisa reported there are lots of free pages in a zone but most of them
> > >is order-0 pages so it means the zone is heavily fragemented.
> > >Then, high order allocation could make direct reclaim path'slong stall(
> > >ex, 50 second) in no swap and no compaction environment.
> > >
> > >The reason is kswapd can skip the zone's scanning because the zone
> > >is lots of free pages and kswapd changes scanning order from high-order
> > >to 0-order after his first iteration is done because kswapd think
> > >order-0 allocation is the most important.
> > >Look at 73ce02e9 in detail.
> > >
> > >The problem from that is that only kswapd can set zone->all_unreclaimable
> > >to 1 at the moment so direct reclaim path should loop forever until a ghost
> > >can set the zone->all_unreclaimable to 1.
> > >
> > >This patch makes direct reclaim path to set zone->all_unreclaimable
> > >to avoid infinite loop. So now we don't need a ghost.
> > >
> > >Reported-by: Lisa Du <cldu@xxxxxxxxxxx>
> > >Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > >---
> > > mm/vmscan.c |   29 ++++++++++++++++++++++++++++-
> > > 1 file changed, 28 insertions(+), 1 deletion(-)
> > >
> > >diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >index 33dc256..f957e87 100644
> > >--- a/mm/vmscan.c
> > >+++ b/mm/vmscan.c
> > >@@ -2317,6 +2317,23 @@ static bool all_unreclaimable(struct zonelist
> > >*zonelist,
> > > 	return true;
> > > }
> > >
> > >+static void check_zones_unreclaimable(struct zonelist *zonelist,
> > >+					struct scan_control *sc)
> > >+{
> > >+	struct zoneref *z;
> > >+	struct zone *zone;
> > >+
> > >+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > >+			gfp_zone(sc->gfp_mask), sc->nodemask) {
> > >+		if (!populated_zone(zone))
> > >+			continue;
> > >+		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > >+			continue;
> > >+		if (!zone_reclaimable(zone))
> > >+			zone->all_unreclaimable = 1;
> > >+	}
> > >+}
> > >+
> > > /*
> > >  * This is the main entry point to direct page reclaim.
> > >  *
> > >@@ -2370,7 +2387,17 @@ static unsigned long
> > >do_try_to_free_pages(struct zonelist *zonelist,
> > > 				lru_pages += zone_reclaimable_pages(zone);
> > > 			}
> > >
> > >-			shrink_slab(shrink, sc->nr_scanned, lru_pages);
> > >+			/*
> > >+			 * When a zone has enough order-0 free memory but
> > >+			 * zone is heavily fragmented and we need high order
> > >+			 * page from the zone, kswapd could skip the zone
> > >+			 * after first iteration with high order. So, kswapd
> > >+			 * never set the zone->all_unreclaimable to 1 so
> > >+			 * direct reclaim path needs the check.
> > >+			 */
> > >+			if (!shrink_slab(shrink, sc->nr_scanned, lru_pages))
> > >+				check_zones_unreclaimable(zonelist, sc);
> > >+
> > > 			if (reclaim_state) {
> > > 				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > > 				reclaim_state->reclaimed_slab = 0;
> > >--
> > >1.7.9.5
> > >
> > >--
> > >Kind regards,
> > >Minchan Kim
> 

I reviewed current mmotm because recently Mel changed kswapd a lot and
all_unreclaimable patch history today.
What I see is recent mmotm has a same problem, too if system have no swap
and no compaction. Of course, compaction is default yes option so we could
recommend to enable if system works well but it's up to user and we should
avoid direct reclaim hang although user disable compaction.

When I see the patch history, real culprit is 929bea7c.

"  zone->all_unreclaimable and zone->pages_scanned are neigher atomic
    variables nor protected by lock.  Therefore zones can become a state of
    zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
    all_unreclaimable() return false even though zone->all_unreclaimabe=1."

I understand the problem but apparently, it makes Lisa's problem because
kswapd can give up balancing when high order allocation happens to prevent
excessive reclaim with assuming the process requested high order allocation
can do direct reclaim/compaction. But what if the process can't reclaim
by no swap but lots of anon pages and can't compact by !CONFIG_COMPACTION?

In such system, OOM kill is natural but not hang.
So, a solution we can fix simply introduces zone_reclaimable check again in
all_unreclaimabe() like this.

What do you think about it?

It's a same patch Lisa posted so we should give a credit
to her/him(Sorry I'm not sure) if we agree thie approach.

Lisa, If KOSAKI agree with this, could you resend this patch with your SOB?

Thanks.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a3bf7fd..78f46d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2367,7 +2367,15 @@ static bool all_unreclaimable(struct zonelist *zonelist,
 			continue;
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
-		if (!zone->all_unreclaimable)
+		/*
+		 * zone->page_scanned and could be raced so we need
+		 * dobule check by zone->all_unreclaimable. Morever, kswapd
+		 * could skip (zone->all_unreclaimable = 1) if the zone
+		 * is heavily fragmented but enough free pages to meet
+		 * high watermark. In such case, kswapd never set
+		 * all_unreclaimable to 1 so we need zone_reclaimable, too.
+		 */
+		if (!zone->all_unreclaimable || zone_reclaimable(zone))
 			return false;
 	}
 


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>