RE: Possible deadloop in direct reclaim?

Lisa Du <cldu@xxxxxxxxxxx> · Thu, 1 Aug 2013 18:03:32 -0700



>-----Original Message-----
>From: Minchan Kim [mailto:minchan@xxxxxxxxxx]
>Sent: 2013年8月1日 16:43
>To: Lisa Du
>Cc: linux-mm@xxxxxxxxx; KOSAKI Motohiro
>Subject: Re: Possible deadloop in direct reclaim?
>
>On Thu, Aug 01, 2013 at 01:20:34AM -0700, Lisa Du wrote:
>> >-----Original Message-----
>> >From: Minchan Kim [mailto:minchan@xxxxxxxxxx]
>> >Sent: 2013年8月1日 15:34
>> >To: Lisa Du
>> >Cc: linux-mm@xxxxxxxxx; KOSAKI Motohiro
>> >Subject: Re: Possible deadloop in direct reclaim?
>> >
>> >On Wed, Jul 31, 2013 at 11:13:07PM -0700, Lisa Du wrote:
>> >> >On Mon, Jul 22, 2013 at 09:58:17PM -0700, Lisa Du wrote:
>> >> >> Dear Sir:
>> >> >> Currently I met a possible deadloop in direct reclaim. After run
>plenty
>> >of
>> >> >the application, system run into a status that system memory is very
>> >> >fragmentized. Like only order-0 and order-1 memory left.
>> >> >> Then one process required a order-2 buffer but it enter an endless
>> >direct
>> >> >reclaim. From my trace log, I can see this loop already over 200,000
>> >times.
>> >> >Kswapd was first wake up and then go back to sleep as it cannot
>> >rebalance
>> >> >this order's memory. But zone->all_unreclaimable remains 1.
>> >> >> Though direct_reclaim every time returns no pages, but as
>> >> >zone->all_unreclaimable = 1, so it loop again and again. Even when
>> >> >zone->pages_scanned also becomes very large. It will block the
>process
>> >for
>> >> >long time, until some watchdog thread detect this and kill this
>process.
>> >> >Though it's in __alloc_pages_slowpath, but it's too slow right? Maybe
>> >cost
>> >> >over 50 seconds or even more.
>> >> >> I think it's not as expected right?  Can we also add below check in
>the
>> >> >function all_unreclaimable() to terminate this loop?
>> >> >>
>> >> >> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct
>zonelist
>> >> >*zonelist,
>> >> >>                         continue;
>> >> >>                 if (!zone->all_unreclaimable)
>> >> >>                         return false;
>> >> >> +               if (sc->nr_reclaimed == 0
>> >&& !zone_reclaimable(zone))
>> >> >> +                       return true;
>> >> >>         }
>> >> >>          BTW: I'm using kernel3.4, I also try to search in the
>> >kernel3.9,
>> >> >didn't see a possible fix for such issue. Or is anyone also met such
>issue
>> >> >before? Any comment will be welcomed, looking forward to your
>reply!
>> >> >>
>> >> >> Thanks!
>> >> >
>> >> >I'd like to ask somethigs.
>> >> >
>> >> >1. Do you have enabled swap?
>> >> I set CONFIG_SWAP=y, but I didn't really have a swap partition, that
>> >means my swap buffer size is 0;
>> >> >2. Do you enable CONFIG_COMPACTION?
>> >> No, I didn't enable;
>> >> >3. Could we get your zoneinfo via cat /proc/zoneinfo?
>> >> I dump some info from ramdump, please review:
>> >
>> >Thanks for the information.
>> >You said order-2 allocation was failed so I will assume preferred zone
>> >is normal zone, not high zone because high order allocation in kernel
>side
>> >isn't from high zone.
>> Yes, that's right!
>> >
>> >> crash> kmem -z
>> >> NODE: 0  ZONE: 0  ADDR: c08460c0  NAME: "Normal"
>> >>   SIZE: 192512  PRESENT: 182304  MIN/LOW/HIGH: 853/1066/1279
>> >
>> >712M normal memory.
>> >
>> >>   VM_STAT:
>> >>           NR_FREE_PAGES: 16092
>> >
>> >There are plenty of free pages over high watermark but there are heavy
>> >fragmentation as I see below information.
>> >
>> >So, kswapd doesn't scan this zone loop iteration is done with order-2.
>> >I mean kswapd will scan this zone with order-0 if first iteration is
>> >done by this
>> >
>> >        order = sc.order = 0;
>> >
>> >        goto loop_again;
>> >
>> >But this time, zone_watermark_ok_safe with testorder = 0 on normal
>zone
>> >is always true so that scanning of zone will be skipped. It means kswapd
>> >never set zone->unreclaimable to 1.
>> Yes, definitely!
>> >
>> >>        NR_INACTIVE_ANON: 17
>> >>          NR_ACTIVE_ANON: 55091
>> >>        NR_INACTIVE_FILE: 17
>> >>          NR_ACTIVE_FILE: 17
>> >>          NR_UNEVICTABLE: 0
>> >>                NR_MLOCK: 0
>> >>           NR_ANON_PAGES: 55077
>> >
>> >There are about 200M anon pages and few file pages.
>> >You don't have swap so that reclaimer couldn't go far.
>> >
>> >>          NR_FILE_MAPPED: 42
>> >>           NR_FILE_PAGES: 69
>> >>           NR_FILE_DIRTY: 0
>> >>            NR_WRITEBACK: 0
>> >>     NR_SLAB_RECLAIMABLE: 1226
>> >>   NR_SLAB_UNRECLAIMABLE: 9373
>> >>            NR_PAGETABLE: 2776
>> >>         NR_KERNEL_STACK: 798
>> >>         NR_UNSTABLE_NFS: 0
>> >>               NR_BOUNCE: 0
>> >>         NR_VMSCAN_WRITE: 91
>> >>     NR_VMSCAN_IMMEDIATE: 115381
>> >>       NR_WRITEBACK_TEMP: 0
>> >>        NR_ISOLATED_ANON: 0
>> >>        NR_ISOLATED_FILE: 0
>> >>                NR_SHMEM: 31
>> >>              NR_DIRTIED: 15256
>> >>              NR_WRITTEN: 11981
>> >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
>> >>
>> >> NODE: 0  ZONE: 1  ADDR: c08464c0  NAME: "HighMem"
>> >>   SIZE: 69632  PRESENT: 69088  MIN/LOW/HIGH: 67/147/228
>> >>   VM_STAT:
>> >>           NR_FREE_PAGES: 161
>> >
>> >Reclaimer should reclaim this zone.
>> >
>> >>        NR_INACTIVE_ANON: 104
>> >>          NR_ACTIVE_ANON: 46114
>> >>        NR_INACTIVE_FILE: 9722
>> >>          NR_ACTIVE_FILE: 12263
>> >
>> >It seems there are lots of room to evict file pages.
>> >
>> >>          NR_UNEVICTABLE: 168
>> >>                NR_MLOCK: 0
>> >>           NR_ANON_PAGES: 46102
>> >>          NR_FILE_MAPPED: 12227
>> >>           NR_FILE_PAGES: 22270
>> >>           NR_FILE_DIRTY: 1
>> >>            NR_WRITEBACK: 0
>> >>     NR_SLAB_RECLAIMABLE: 0
>> >>   NR_SLAB_UNRECLAIMABLE: 0
>> >>            NR_PAGETABLE: 0
>> >>         NR_KERNEL_STACK: 0
>> >>         NR_UNSTABLE_NFS: 0
>> >>               NR_BOUNCE: 0
>> >>         NR_VMSCAN_WRITE: 0
>> >>     NR_VMSCAN_IMMEDIATE: 0
>> >>       NR_WRITEBACK_TEMP: 0
>> >>        NR_ISOLATED_ANON: 0
>> >>        NR_ISOLATED_FILE: 0
>> >>                NR_SHMEM: 117
>> >>              NR_DIRTIED: 7364
>> >>              NR_WRITTEN: 6989
>> >> NR_ANON_TRANSPARENT_HUGEPAGES: 0
>> >>
>> >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
>> >START_MAPNR
>> >>   0   Normal    192512   16092  c1200000       0
>0
>> >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
>> >>   0       4k      c08460f0           3      3
>> >>   0       4k      c08460f8         436    436
>> >>   0       4k      c0846100       15237  15237
>> >>   0       4k      c0846108           0      0
>> >>   0       4k      c0846110           0      0
>> >>   1       8k      c084611c          39     78
>> >>   1       8k      c0846124           0      0
>> >>   1       8k      c084612c         169    338
>> >>   1       8k      c0846134           0      0
>> >>   1       8k      c084613c           0      0
>> >>   2      16k      c0846148           0      0
>> >>   2      16k      c0846150           0      0
>> >>   2      16k      c0846158           0      0
>> >> ---------Normal zone all order > 1 has no free pages
>> >> ZONE  NAME        SIZE    FREE  MEM_MAP   START_PADDR
>> >START_MAPNR
>> >>   1   HighMem    69632     161  c17e0000    2f000000
>> >192512
>> >> AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
>> >>   0       4k      c08464f0          12     12
>> >>   0       4k      c08464f8           0      0
>> >>   0       4k      c0846500          14     14
>> >>   0       4k      c0846508           3      3
>> >>   0       4k      c0846510           0      0
>> >>   1       8k      c084651c           0      0
>> >>   1       8k      c0846524           0      0
>> >>   1       8k      c084652c           0      0
>> >>   2      16k      c0846548           0      0
>> >>   2      16k      c0846550           0      0
>> >>   2      16k      c0846558           0      0
>> >>   2      16k      c0846560           1      4
>> >>   2      16k      c0846568           0      0
>> >>   5     128k      c08465cc           0      0
>> >>   5     128k      c08465d4           0      0
>> >>   5     128k      c08465dc           0      0
>> >>   5     128k      c08465e4           4    128
>> >>   5     128k      c08465ec           0      0
>> >> ------Other's all zero
>> >>
>> >> Some other zone information I dump from pglist_data
>> >> {
>> >>   watermark = {853, 1066, 1279},
>> >>       percpu_drift_mark = 0,
>> >>       lowmem_reserve = {0, 2159, 2159},
>> >>       dirty_balance_reserve = 3438,
>> >>       pageset = 0xc07f6144,
>> >>       lock = {
>> >>         {
>> >>           rlock = {
>> >>             raw_lock = {
>> >>               lock = 0
>> >>             },
>> >>             break_lock = 0
>> >>           }
>> >>         }
>> >>       },
>> >>   all_unreclaimable = 0,
>> >>       reclaim_stat = {
>> >>         recent_rotated = {903355, 960912},
>> >>         recent_scanned = {932404, 2462017}
>> >>       },
>> >>       pages_scanned = 84231,
>> >
>> >Most of scan happens in direct reclaim path, I guess
>> >but direct reclaim couldn't reclaim any pages due to lack of swap device.
>> >
>> >It means we have to set zone->all_unreclaimable in direct reclaim path,
>> >too.
>> >Below patch fix your problem?
>> Yes, your patch should fix my problem!
>> Actually I also did another patch, after test, should also fix my issue,
>> but I didn't set zone->all_unreclaimable in direct reclaim path as you,
>> just double check zone_reclaimable() status in all_unreclaimable()
>function.
>> Maybe your patch is better!
>
>Nope. I think your patch is better. :)
>Just thing is anlaysis of the problem and description and I think we could
>do
>better but unfortunately, I don't have enough time today so I will see
>tomorrow.
>Just nitpick below.
>
>Thanks.
>
>>
>> commit 26d2b60d06234683a81666da55129f9c982271a5
>> Author: Lisa Du <cldu@xxxxxxxxxxx>
>> Date:   Thu Aug 1 10:16:32 2013 +0800
>>
>>     mm: fix infinite direct_reclaim when memory is very fragmentized
>>
>>     latest all_unreclaimable check in direct reclaim is the following
>commit.
>>     2011 Apr 14; commit 929bea7c; vmscan:  all_unreclaimable() use
>>                                 zone->all_unreclaimable as a name
>>     and in addition, add oom_killer_disabled check to avoid reintroduce
>the
>>     issue of commit d1908362 ("vmscan: check all_unreclaimable in
>direct reclaim path").
>>
>>     But except the hibernation case in which kswapd is freezed, there's
>also other case
>>     which may lead infinite loop in direct relaim. In a real test,
>direct_relaimer did
>>     over 200000 times rebalance in __alloc_pages_slowpath(), so this
>process will be
>>     blocked until watchdog detect and kill it. The root cause is as below:
>>
>>     If system memory is very fragmentized like only order-0 and order-1
>left,
>>     kswapd will go to sleep as system cann't rebalanced for high-order
>allocations.
>>     But direct_reclaim still works for higher order request. So zones can
>become a state
>>     zone->all_unreclaimable = 0 but zone->pages_scanned >
>zone_reclaimable_pages(zone) * 6.
>>     In this case if a process like do_fork try to allocate an order-2
>memory which is not
>>     a COSTLY_ORDER, as direct_reclaim always said it
>did_some_progress, so rebalance again
>>     and again in __alloc_pages_slowpath(). This issue is easily happen in
>no swap and no
>>     compaction enviroment.
>>
>>     So add furthur check in all_unreclaimable() to avoid such case.
>>
>>     Change-Id: Id3266b47c63f5b96aab466fd9f1f44d37e16cdcb
>>     Signed-off-by: Lisa Du <cldu@xxxxxxxxxxx>
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2cff0d4..34582d9 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2301,7 +2301,9 @@ static bool all_unreclaimable(struct zonelist
>*zonelist,
>>                         continue;
>>                 if (!cpuset_zone_allowed_hardwall(zone,
>GFP_KERNEL))
>>                         continue;
>> -               if (!zone->all_unreclaimable)
>> +               if (zone->all_unreclaimable)
>> +                       continue;
>
>Nitpick: If we use zone_reclaimable(), above check is redundant and
>gain is very tiny because this path is already slow.
Yes, I agree, I add above check just want to avoid the issue Kosaki met which fix by the commit 929bea7c.
In short, to avoid the case zone->all_unreclaimable = 1, but zone->pages_scanned = 0, so only check zone_reclaimable() should not enough.
>
>> +               if (zone_reclaimable(zone))
>>                         return false;
>>         }
>> >
>> >From a5d82159b98f3d90c2f9ff9e486699fb4c67cced Mon Sep 17 00:00:00
>> >2001
>> >From: Minchan Kim <minchan@xxxxxxxxxx>
>> >Date: Thu, 1 Aug 2013 16:18:00 +0900
>> >Subject:[PATCH] mm: set zone->all_unreclaimable in direct reclaim
>> > path
>> >
>> >Lisa reported there are lots of free pages in a zone but most of them
>> >is order-0 pages so it means the zone is heavily fragemented.
>> >Then, high order allocation could make direct reclaim path'slong stall(
>> >ex, 50 second) in no swap and no compaction environment.
>> >
>> >The reason is kswapd can skip the zone's scanning because the zone
>> >is lots of free pages and kswapd changes scanning order from high-order
>> >to 0-order after his first iteration is done because kswapd think
>> >order-0 allocation is the most important.
>> >Look at 73ce02e9 in detail.
>> >
>> >The problem from that is that only kswapd can set
>zone->all_unreclaimable
>> >to 1 at the moment so direct reclaim path should loop forever until a
>ghost
>> >can set the zone->all_unreclaimable to 1.
>> >
>> >This patch makes direct reclaim path to set zone->all_unreclaimable
>> >to avoid infinite loop. So now we don't need a ghost.
>> >
>> >Reported-by: Lisa Du <cldu@xxxxxxxxxxx>
>> >Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
>> >---
>> > mm/vmscan.c |   29 ++++++++++++++++++++++++++++-
>> > 1 file changed, 28 insertions(+), 1 deletion(-)
>> >
>> >diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >index 33dc256..f957e87 100644
>> >--- a/mm/vmscan.c
>> >+++ b/mm/vmscan.c
>> >@@ -2317,6 +2317,23 @@ static bool all_unreclaimable(struct zonelist
>> >*zonelist,
>> >    return true;
>> > }
>> >
>> >+static void check_zones_unreclaimable(struct zonelist *zonelist,
>> >+                                   struct scan_control *sc)
>> >+{
>> >+   struct zoneref *z;
>> >+   struct zone *zone;
>> >+
>> >+   for_each_zone_zonelist_nodemask(zone, z, zonelist,
>> >+                   gfp_zone(sc->gfp_mask), sc->nodemask) {
>> >+           if (!populated_zone(zone))
>> >+                   continue;
>> >+           if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>> >+                   continue;
>> >+           if (!zone_reclaimable(zone))
>> >+                   zone->all_unreclaimable = 1;
>> >+   }
>> >+}
>> >+
>> > /*
>> >  * This is the main entry point to direct page reclaim.
>> >  *
>> >@@ -2370,7 +2387,17 @@ static unsigned long
>> >do_try_to_free_pages(struct zonelist *zonelist,
>> >                            lru_pages += zone_reclaimable_pages(zone);
>> >                    }
>> >
>> >-                   shrink_slab(shrink, sc->nr_scanned, lru_pages);
>> >+                   /*
>> >+                    * When a zone has enough order-0 free memory but
>> >+                    * zone is heavily fragmented and we need high order
>> >+                    * page from the zone, kswapd could skip the zone
>> >+                    * after first iteration with high order. So, kswapd
>> >+                    * never set the zone->all_unreclaimable to 1 so
>> >+                    * direct reclaim path needs the check.
>> >+                    */
>> >+                   if (!shrink_slab(shrink, sc->nr_scanned, lru_pages))
>> >+                           check_zones_unreclaimable(zonelist, sc);
>> >+
>> >                    if (reclaim_state) {
>> >                            sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>> >                            reclaim_state->reclaimed_slab = 0;
>> >--
>> >1.7.9.5
>> >
>> >--
>> >Kind regards,
>> >Minchan Kim
>
>--
>Kind regards,
>Minchan Kim
��.n������g����a����&ޖ)���)��h���&������梷�����Ǟ�m������)������^�����������v���O��zf������