RE: Possible deadloop in direct reclaim?

Lisa Du <cldu@xxxxxxxxxxx> · Tue, 23 Jul 2013 18:21:29 -0700

Dear Christoph
   Thanks a lot for your comment. When this issue happen I just trigger a kernel panic and got the kdump.
>From the kdump, I got the global variable pg_data_t congit_page_data. From this structure, I can see in normal zone, only order-0's nr_free = 18442, order-1's nr_free = 367, all the other order's nr_free is 0.

Thanks!

Best Regards
Lisa Du

-----Original Message-----
From: Christoph Lameter [mailto:cl@xxxxxxxxx] 
Sent: 2013年7月24日 4:29
To: Lisa Du
Cc: linux-mm@xxxxxxxxx; Mel Gorman
Subject: Re: Possible deadloop in direct reclaim?

On Mon, 22 Jul 2013, Lisa Du wrote:

> Currently I met a possible deadloop in direct reclaim. After run plenty of the application, system run into a status that system memory is very fragmentized. Like only order-0 and order-1 memory left.

Can you verify that by doing a

 cat /proc/buddyinfo

?

> Then one process required a order-2 buffer but it enter an endless
> direct reclaim. From my trace log, I can see this loop already over
> 200,000 times. Kswapd was first wake up and then go back to sleep as it
> cannot rebalance this order's memory. But zone->all_unreclaimable
> remains 1. Though direct_reclaim every time returns no pages, but as
> zone->all_unreclaimable = 1, so it loop again and again. Even when
> zone->pages_scanned also becomes very large. It will block the process
> for long time, until some watchdog thread detect this and kill this
> process. Though it's in __alloc_pages_slowpath, but it's too slow right?
> Maybe cost over 50 seconds or even more.

> I think it's not as expected right?  Can we also add below check in the
> function all_unreclaimable() to terminate this loop?
>
> @@ -2355,6 +2355,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>                         continue;
>                 if (!zone->all_unreclaimable)
>                         return false;
> +               if (sc->nr_reclaimed == 0 && !zone_reclaimable(zone))
> +                       return true;
>         }

Mel?

?韬{.n???檩jg???a?旃???)钋???骅w+h?璀?ｙ/i?⒏??⒎???Щ??m???)钋???痂?^??觥??ザ?v???O璁?f??i?⒏?