Re: [patch v2]vmscan: correctly detect GFP_ATOMIC allocation failure

Shaohua Li <shaohua.li@xxxxxxxxx> · Wed, 12 Oct 2011 10:48:59 +0800



On Tue, 2011-10-11 at 14:54 +0800, Minchan Kim wrote:
> On Tue, Oct 11, 2011 at 01:30:10PM +0800, Shaohua Li wrote:
> > On Mon, 2011-10-10 at 23:42 +0800, Minchan Kim wrote:
> > > On Mon, Oct 10, 2011 at 03:28:13PM +0800, Shaohua Li wrote:
> > > > On Sun, 2011-10-09 at 23:10 +0800, Minchan Kim wrote:
> > > > > On Sun, Oct 09, 2011 at 04:17:51PM +0800, Shaohua Li wrote:
> > > > > > On Sun, 2011-10-09 at 16:01 +0800, Minchan Kim wrote:
> > > > > > > On Sun, Oct 09, 2011 at 01:53:11PM +0800, Shaohua Li wrote:
> > > > > > > > On Sat, 2011-10-08 at 18:25 +0800, Minchan Kim wrote:
> > > > > > > > > On Sat, Oct 08, 2011 at 01:56:52PM +0800, Shaohua Li wrote:
> > > > > > > > > > On Sat, 2011-10-08 at 11:35 +0800, Shaohua Li wrote:
> > > > > > > > > > > On Sat, 2011-10-08 at 11:19 +0800, David Rientjes wrote:
> > > > > > > > > > > > On Sat, 8 Oct 2011, Shaohua Li wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > has_under_min_watermark_zone is used to detect if there is GFP_ATOMIC allocation
> > > > > > > > > > > > > failure risk. For a high end_zone, if any zone below or equal to it has min
> > > > > > > > > > > > > matermark ok, we have no risk. But current logic is any zone has min watermark
> > > > > > > > > > > > > not ok, then we have risk. This is wrong to me.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Shaohua Li <shaohua.li@xxxxxxxxx>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  mm/vmscan.c |    7 ++++---
> > > > > > > > > > > > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Index: linux/mm/vmscan.c
> > > > > > > > > > > > > ===================================================================
> > > > > > > > > > > > > --- linux.orig/mm/vmscan.c      2011-09-27 15:09:29.000000000 +0800
> > > > > > > > > > > > > +++ linux/mm/vmscan.c   2011-09-27 15:14:45.000000000 +0800
> > > > > > > > > > > > > @@ -2463,7 +2463,7 @@ loop_again:
> > > > > > > > > > > > >
> > > > > > > > > > > > >         for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > > > > > > > > > > > >                 unsigned long lru_pages = 0;
> > > > > > > > > > > > > -               int has_under_min_watermark_zone = 0;
> > > > > > > > > > > > > +               int has_under_min_watermark_zone = 1;
> > > > > > > > > > > >
> > > > > > > > > > > > bool
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >                 /* The swap token gets in the way of swapout... */
> > > > > > > > > > > > >                 if (!priority)
> > > > > > > > > > > > > @@ -2594,9 +2594,10 @@ loop_again:
> > > > > > > > > > > > >                                  * means that we have a GFP_ATOMIC allocation
> > > > > > > > > > > > >                                  * failure risk. Hurry up!
> > > > > > > > > > > > >                                  */
> > > > > > > > > > > > > -                               if (!zone_watermark_ok_safe(zone, order,
> > > > > > > > > > > > > +                               if (has_under_min_watermark_zone &&
> > > > > > > > > > > > > +                                           zone_watermark_ok_safe(zone, order,
> > > > > > > > > > > > >                                             min_wmark_pages(zone), end_zone, 0))
> > > > > > > > > > > > > -                                       has_under_min_watermark_zone = 1;
> > > > > > > > > > > > > +                                       has_under_min_watermark_zone = 0;
> > > > > > > > > > > > >                         } else {
> > > > > > > > > > > > >                                 /*
> > > > > > > > > > > > >                                  * If a zone reaches its high watermark,
> > > > > > > > > > > >
> > > > > > > > > > > > Ignore checking the min watermark for a moment and consider if all zones
> > > > > > > > > > > > are above the high watermark (a situation where kswapd does not need to
> > > > > > > > > > > > do aggressive reclaim), then has_under_min_watermark_zone doesn't get
> > > > > > > > > > > > cleared and never actually stalls on congestion_wait().  Notice this is
> > > > > > > > > > > > congestion_wait() and not wait_iff_congested(), so the clearing of
> > > > > > > > > > > > ZONE_CONGESTED doesn't prevent this.
> > > > > > > > > > > if all zones are above the high watermark, we will have i < 0 when
> > > > > > > > > > > detecting the highest imbalanced zone, and the whole loop will end
> > > > > > > > > > > without run into congestion_wait().
> > > > > > > > > > > or I can add a clearing has_under_min_watermark_zone in the else block
> > > > > > > > > > > to be safe.
> > > > > > > > > > Subject: vmscan: correctly detect GFP_ATOMIC allocation failure -v2
> > > > > > > > > >
> > > > > > > > > > has_under_min_watermark_zone is used to detect if there is GFP_ATOMIC allocation
> > > > > > > > > > failure risk. For a high end_zone, if any zone below or equal to it has min
> > > > > > > > > > matermark ok, we have no risk. But current logic is any zone has min watermark
> > > > > > > > > > not ok, then we have risk. This is wrong to me.
> > > > > > > > >
> > > > > > > > > I think it's not a right or wrong problem but a policy stuff.
> > > > > > > > > If we are going to start busy reclaiming for atomic allocation
> > > > > > > > > after we see all lower zones' min water mark pages are already consumed
> > > > > > > > > It could make you go through long latency and is likely to fail atomic allocation
> > > > > > > > > stream(Because, there is nothing to do for aotmic allocation fail in direct reclaim
> > > > > > > > > so kswapd should always do best effort for it)
> > > > > > > > >
> > > > > > > > > I don't mean you are wrong but we are very careful about this
> > > > > > > > > and at least need some experiments with atomic allocaion stream, I think.
> > > > > > > > yes. this is a policy problem. I just don't want the kswapd keep running
> > > > > > > > even there is no immediate risk of atomic allocation fail.
> > > > > > > > One problem here is end_zone could be high, and low zone always doesn't
> > > > > > > > meet min watermark. So kswapd keeps running without a wait and builds
> > > > > > > > big priority.
> > > > > > >
> > > > > > > It could be but I think it's a mistake of admin if he handles such rare system.
> > > > > > > Couldn't he lower the reserved pages for highmem?
> > > > > > not because admin changes reserved pages. we still have the
> > > > > > zone->lowmem_reserve[] issue for zone_watermark_ok here.
> > > > >
> > > > > Sorry I couldn't understand your point.
> > > > > I mean if min watermark is too high, you could lower min_free_kbytes.
> > > > > If reserved pages is too high, you could handle lowmem_reserve_ratio.
> > > > > Could we solve the problem with those knobs?
> > > > I mean a system with default setting. Changing the knobs might solve the
> > > > problem, but few people know it, so not a right solution.
> > >
> > > We couldn't cover whole cases so that we should focus on common case.
> > > That's why we need knobs.
> > > Do you think your case is general one we should handle it by default?
> > > If you really think, we have to fix watermark setting logic rathe than kswapd.
> > it's pretty common a system has big high zone and small low zone these
> > days. Why current watermark setting is wrong?
> 
> You said as follows,
> 
> > > > > > > > One problem here is end_zone could be high, and low zone always doesn't
> > > > > > > > meet min watermark. So kswapd keeps running without a wait and builds
> 
> You should have said not "meet min watermark" but "meet min_watermak + reserved pages[HIGH_ZONE]"
> for the clearness. Sorry for misleading your statement. Now I got realized your intention.
I should have made it clear, sorry.

> > > > > In addition, kswapd could easily set all_unreclaimable of the zone in your example.
> > > > > Then, kswapd should just peek the zone once in a while if the zone is all_unreclaimable.
> > > > > It should be no problem in CPU overhead.
> > > > even the low zone has all_unreclaimable, the high zone will be scanned
> > > > again with low priority (because its high watermark might not be ok, but
> > > > min watermark is ok), and does not do congestion_wait. This will make a
> > >
> > > As you said, high zone is below high watermark so it's natural since it's
> > > one of kswapd goal. Why it doesn't sleep is that one zone of zonelists is
> > > below min_watmark. So it's natural in current policy, too.
> > there are two cases one zone is below min_watermark.
> > 1. the zone is below min_watermark for allocation in the zone. in this
> > case we need hurry up.
> > 2. the zone is below min_watermark for allocation from high zone. we
> > don't really need hurry up if other zone is above min_watermark.
> > Since low zone need to reserve pages for high zone, the second case
> > could be common.
> 
> You mean "lowmem_reserve"?
> It means opposite. It is a mechanism to defend using of lowmem pages from high zone allocation
> because it could be fatal to allow process pages to be allocated from low zone.
> Also, We could set each ratio for reserved pages of zones.
> How could we make sure lower zones have enough free pages for higher zone?
lowmem_reserve causes the problem, but it's not a fault of
lowmem_reserve. I'm thinking changing it.

> > Yes, keeping kswapd running in this case can reduce the chance
> > GFP_ATOMIC failure. But my patch will not cause immediate failure
> > because there is still some zones which are above min_watermark and can
> > meet the GFP_ATOMIC allocation. And keeping kswapd running has some
> 
> True. It was why I said "I don't mean you are wrong but we are very careful about this."
> Normally, it could handle but might fail on sudden peak of atomic allocation stream.
> Recently, we have suffered from many reporting of GFP_AOTMIC allocation than olded.
> So I would like to be very careful and that's why I suggest we need at least some experiment.
> Through it, Andrew could make final call.
sure.

> > drawbacks:
> > 1. cpu overhead
> > 2. extend isolate window size, so trash working set.
> > Considering DMA zone, we almost always have DMA zone min_watermark not
> > ok for any allocation from high zone. So we will always have such
> > drawbacks.
> 
> I agree with you in that it's a problem.
> I think the real solution is to remove the zone from allocation fallback list in such case
> because the lower zone could never meet any allocation for the higher zone.
> But it makes code rather complicated as we have to consider admin who can change
> reserved pages anytime.
Not worthy the complication.

> So how about this?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8913374..f71ed2f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2693,8 +2693,16 @@ loop_again:
>                                  * failure risk. Hurry up!
>                                  */
>                                 if (!zone_watermark_ok_safe(zone, order,
> -                                           min_wmark_pages(zone), end_zone, 0))
> -                                       has_under_min_watermark_zone = 1;
> +                                           min_wmark_pages(zone), end_zone, 0)) {
> +                                       /*
> +                                        * In case of big reserved page for higher zone,
> +                                        * it is pointless to try reclaimaing pages quickly
> +                                        * because it could never meet the requirement.
> +                                        */
> +                                       if (zone->present_pages >
> +                                               min_wmark_pages(zone) + zone->lowmem_reserve[end_zone])
> +                                               has_under_min_watermark_zone = 1;
> +                               }
>                         } else {
>                                 /*
>                                  * If a zone reaches its high watermark,
This looks like a workaround just for DMA zone. present_pages could be
bigger than min_mwark+lowmem_reserve. And We still suffered from the
issue, for example, a DMA32 zone with some pages allocated for DMA, or a
zone which has some lru pages, but still much smaller than high zone.

> Even, we could apply this at starting of the loop so that we can avoid unnecessary scanning st the beginning.
> In that case, we have to apply zone->lowmem_reserve[end_zone] only because we have to consider NO_WATERMARK alloc case.
yes, we can do this to avoid unnecessary scan. but DMA zone hasn't lru
pages, so not sure how big the benefit is here.

> > Or is something below better? we can avoid the big reserved pages
> > accounting to the min_wmark_pages for low zone. if high zone is under
> > min_wmark, kswapd will not sleep.
> >                                if (!zone_watermark_ok_safe(zone, order,
> > -                                            min_wmark_pages(zone), end_zone, 0))
> > +                                            min_wmark_pages(zone), 0, 0))
> >                                         has_under_min_watermark_zone = 1;
> 
> I think it's not a good idea since page allocator always considers classzone_idx.
> So although we fix kswapd issue through your changes, page allocator still can't allocate memory
> and wakes up kswapd, again.
why kswapd will be waked up again? The high zone itself still has
min_wark+low_reserve ok for the allocation(classzone_idx 0 means
checking the low_reserve for allocation from the zone itself), so the
allocation can be met.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>