RE: Possible deadloop in direct reclaim?

Lisa Du <cldu@xxxxxxxxxxx> · Tue, 23 Jul 2013 18:31:02 -0700




Dear Bob
    Thank you so much for the careful review, Yes, it's a typo, I mean zone->all_unreclaimable = 0.
    You mentioned add the check in kswapd_shrink_zone(), sorry that I didn't find this function in kernel3.4 or kernel3.9.
    Is this function called in direct_reclaim? 
    As I mentioned this issue happened after kswapd thread sleep, if it only called in kswapd, then I think it can't help.

Thanks!

Best Regards
Lisa Du


-----Original Message-----
From: Bob Liu [mailto:lliubbo@xxxxxxxxx] 
Sent: 2013年7月24日 9:18
To: Lisa Du
Cc: linux-mm@xxxxxxxxx; Christoph Lameter; Mel Gorman
Subject: Re: Possible deadloop in direct reclaim?

On Tue, Jul 23, 2013 at 12:58 PM, Lisa Du <cldu@xxxxxxxxxxx> wrote:
> Dear Sir:
>
> Currently I met a possible deadloop in direct reclaim. After run plenty of
> the application, system run into a status that system memory is very
> fragmentized. Like only order-0 and order-1 memory left.
>
> Then one process required a order-2 buffer but it enter an endless direct
> reclaim. From my trace log, I can see this loop already over 200,000 times.
> Kswapd was first wake up and then go back to sleep as it cannot rebalance
> this order’s memory. But zone->all_unreclaimable remains 1.
>
> Though direct_reclaim every time returns no pages, but as
> zone->all_unreclaimable = 1, so it loop again and again. Even when
> zone->pages_scanned also becomes very large. It will block the process for
> long time, until some watchdog thread detect this and kill this process.
> Though it’s in __alloc_pages_slowpath, but it’s too slow right? Maybe cost
> over 50 seconds or even more.

You must be mean zone->all_unreclaimable = 0?

>
> I think it’s not as expected right?  Can we also add below check in the
> function all_unreclaimable() to terminate this loop?
>
>
>
> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist
> *zonelist,
>
>                         continue;
>
>                 if (!zone->all_unreclaimable)
>
>                         return false;
>
> +               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
>
> +                       return true;
>

How about replace the checking in kswapd_shrink_zone()?

@@ -2824,7 +2824,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
        /* Account for the number of pages attempted to reclaim */
        *nr_attempted += sc->nr_to_reclaim;

-       if (nr_slab == 0 && !zone_reclaimable(zone))
+       if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
                zone->all_unreclaimable = 1;

        zone_clear_flag(zone, ZONE_WRITEBACK);


I think the current check is wrong, reclaimed a slab doesn't mean
reclaimed a page.

-- 
Regards,
--Bob
��.n������g����a����&ޖ)���)��h���&������梷�����Ǟ�m������)������^�����������v���O��zf������