Re: [patch v2]vmscan: correctly detect GFP_ATOMIC allocation failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 11, 2011 at 01:30:10PM +0800, Shaohua Li wrote:
> On Mon, 2011-10-10 at 23:42 +0800, Minchan Kim wrote:
> > On Mon, Oct 10, 2011 at 03:28:13PM +0800, Shaohua Li wrote:
> > > On Sun, 2011-10-09 at 23:10 +0800, Minchan Kim wrote:
> > > > On Sun, Oct 09, 2011 at 04:17:51PM +0800, Shaohua Li wrote:
> > > > > On Sun, 2011-10-09 at 16:01 +0800, Minchan Kim wrote:
> > > > > > On Sun, Oct 09, 2011 at 01:53:11PM +0800, Shaohua Li wrote:
> > > > > > > On Sat, 2011-10-08 at 18:25 +0800, Minchan Kim wrote:
> > > > > > > > On Sat, Oct 08, 2011 at 01:56:52PM +0800, Shaohua Li wrote:
> > > > > > > > > On Sat, 2011-10-08 at 11:35 +0800, Shaohua Li wrote:
> > > > > > > > > > On Sat, 2011-10-08 at 11:19 +0800, David Rientjes wrote:
> > > > > > > > > > > On Sat, 8 Oct 2011, Shaohua Li wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > has_under_min_watermark_zone is used to detect if there is GFP_ATOMIC allocation
> > > > > > > > > > > > failure risk. For a high end_zone, if any zone below or equal to it has min
> > > > > > > > > > > > matermark ok, we have no risk. But current logic is any zone has min watermark
> > > > > > > > > > > > not ok, then we have risk. This is wrong to me.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Shaohua Li <shaohua.li@xxxxxxxxx>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  mm/vmscan.c |    7 ++++---
> > > > > > > > > > > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > > > > > > > > > > > 
> > > > > > > > > > > > Index: linux/mm/vmscan.c
> > > > > > > > > > > > ===================================================================
> > > > > > > > > > > > --- linux.orig/mm/vmscan.c	2011-09-27 15:09:29.000000000 +0800
> > > > > > > > > > > > +++ linux/mm/vmscan.c	2011-09-27 15:14:45.000000000 +0800
> > > > > > > > > > > > @@ -2463,7 +2463,7 @@ loop_again:
> > > > > > > > > > > >  
> > > > > > > > > > > >  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > > > > > > > > > > >  		unsigned long lru_pages = 0;
> > > > > > > > > > > > -		int has_under_min_watermark_zone = 0;
> > > > > > > > > > > > +		int has_under_min_watermark_zone = 1;
> > > > > > > > > > > 
> > > > > > > > > > > bool
> > > > > > > > > > > 
> > > > > > > > > > > >  
> > > > > > > > > > > >  		/* The swap token gets in the way of swapout... */
> > > > > > > > > > > >  		if (!priority)
> > > > > > > > > > > > @@ -2594,9 +2594,10 @@ loop_again:
> > > > > > > > > > > >  				 * means that we have a GFP_ATOMIC allocation
> > > > > > > > > > > >  				 * failure risk. Hurry up!
> > > > > > > > > > > >  				 */
> > > > > > > > > > > > -				if (!zone_watermark_ok_safe(zone, order,
> > > > > > > > > > > > +				if (has_under_min_watermark_zone &&
> > > > > > > > > > > > +					    zone_watermark_ok_safe(zone, order,
> > > > > > > > > > > >  					    min_wmark_pages(zone), end_zone, 0))
> > > > > > > > > > > > -					has_under_min_watermark_zone = 1;
> > > > > > > > > > > > +					has_under_min_watermark_zone = 0;
> > > > > > > > > > > >  			} else {
> > > > > > > > > > > >  				/*
> > > > > > > > > > > >  				 * If a zone reaches its high watermark,
> > > > > > > > > > > 
> > > > > > > > > > > Ignore checking the min watermark for a moment and consider if all zones 
> > > > > > > > > > > are above the high watermark (a situation where kswapd does not need to 
> > > > > > > > > > > do aggressive reclaim), then has_under_min_watermark_zone doesn't get 
> > > > > > > > > > > cleared and never actually stalls on congestion_wait().  Notice this is 
> > > > > > > > > > > congestion_wait() and not wait_iff_congested(), so the clearing of 
> > > > > > > > > > > ZONE_CONGESTED doesn't prevent this.
> > > > > > > > > > if all zones are above the high watermark, we will have i < 0 when
> > > > > > > > > > detecting the highest imbalanced zone, and the whole loop will end
> > > > > > > > > > without run into congestion_wait().
> > > > > > > > > > or I can add a clearing has_under_min_watermark_zone in the else block
> > > > > > > > > > to be safe.
> > > > > > > > > Subject: vmscan: correctly detect GFP_ATOMIC allocation failure -v2
> > > > > > > > > 
> > > > > > > > > has_under_min_watermark_zone is used to detect if there is GFP_ATOMIC allocation
> > > > > > > > > failure risk. For a high end_zone, if any zone below or equal to it has min
> > > > > > > > > matermark ok, we have no risk. But current logic is any zone has min watermark
> > > > > > > > > not ok, then we have risk. This is wrong to me.
> > > > > > > > 
> > > > > > > > I think it's not a right or wrong problem but a policy stuff.
> > > > > > > > If we are going to start busy reclaiming for atomic allocation
> > > > > > > > after we see all lower zones' min water mark pages are already consumed
> > > > > > > > It could make you go through long latency and is likely to fail atomic allocation
> > > > > > > > stream(Because, there is nothing to do for aotmic allocation fail in direct reclaim
> > > > > > > > so kswapd should always do best effort for it)
> > > > > > > > 
> > > > > > > > I don't mean you are wrong but we are very careful about this
> > > > > > > > and at least need some experiments with atomic allocaion stream, I think.
> > > > > > > yes. this is a policy problem. I just don't want the kswapd keep running
> > > > > > > even there is no immediate risk of atomic allocation fail.
> > > > > > > One problem here is end_zone could be high, and low zone always doesn't
> > > > > > > meet min watermark. So kswapd keeps running without a wait and builds
> > > > > > > big priority.
> > > > > > 
> > > > > > It could be but I think it's a mistake of admin if he handles such rare system.
> > > > > > Couldn't he lower the reserved pages for highmem?
> > > > > not because admin changes reserved pages. we still have the
> > > > > zone->lowmem_reserve[] issue for zone_watermark_ok here.
> > > > 
> > > > Sorry I couldn't understand your point.
> > > > I mean if min watermark is too high, you could lower min_free_kbytes.
> > > > If reserved pages is too high, you could handle lowmem_reserve_ratio.
> > > > Could we solve the problem with those knobs?
> > > I mean a system with default setting. Changing the knobs might solve the
> > > problem, but few people know it, so not a right solution.
> > 
> > We couldn't cover whole cases so that we should focus on common case.
> > That's why we need knobs.
> > Do you think your case is general one we should handle it by default?
> > If you really think, we have to fix watermark setting logic rathe than kswapd.
> it's pretty common a system has big high zone and small low zone these
> days. Why current watermark setting is wrong?

You said as follows,

> > > > > > > One problem here is end_zone could be high, and low zone always doesn't
> > > > > > > meet min watermark. So kswapd keeps running without a wait and builds

You should have said not "meet min watermark" but "meet min_watermak + reserved pages[HIGH_ZONE]"
for the clearness. Sorry for misleading your statement. Now I got realized your intention.

>  
> > > > In addition, kswapd could easily set all_unreclaimable of the zone in your example.
> > > > Then, kswapd should just peek the zone once in a while if the zone is all_unreclaimable.
> > > > It should be no problem in CPU overhead.
> > > even the low zone has all_unreclaimable, the high zone will be scanned
> > > again with low priority (because its high watermark might not be ok, but
> > > min watermark is ok), and does not do congestion_wait. This will make a
> > 
> > As you said, high zone is below high watermark so it's natural since it's
> > one of kswapd goal. Why it doesn't sleep is that one zone of zonelists is
> > below min_watmark. So it's natural in current policy, too.
> there are two cases one zone is below min_watermark.
> 1. the zone is below min_watermark for allocation in the zone. in this
> case we need hurry up.
> 2. the zone is below min_watermark for allocation from high zone. we
> don't really need hurry up if other zone is above min_watermark.
> Since low zone need to reserve pages for high zone, the second case
> could be common.

You mean "lowmem_reserve"?
It means opposite. It is a mechanism to defend using of lowmem pages from high zone allocation
because it could be fatal to allow process pages to be allocated from low zone.
Also, We could set each ratio for reserved pages of zones.
How could we make sure lower zones have enough free pages for higher zone?

> Yes, keeping kswapd running in this case can reduce the chance
> GFP_ATOMIC failure. But my patch will not cause immediate failure
> because there is still some zones which are above min_watermark and can
> meet the GFP_ATOMIC allocation. And keeping kswapd running has some

True. It was why I said "I don't mean you are wrong but we are very careful about this."
Normally, it could handle but might fail on sudden peak of atomic allocation stream.
Recently, we have suffered from many reporting of GFP_AOTMIC allocation than olded.
So I would like to be very careful and that's why I suggest we need at least some experiment.
Through it, Andrew could make final call.

> drawbacks:
> 1. cpu overhead
> 2. extend isolate window size, so trash working set.
> Considering DMA zone, we almost always have DMA zone min_watermark not
> ok for any allocation from high zone. So we will always have such
> drawbacks.

I agree with you in that it's a problem.
I think the real solution is to remove the zone from allocation fallback list in such case
because the lower zone could never meet any allocation for the higher zone.
But it makes code rather complicated as we have to consider admin who can change
reserved pages anytime.
So how about this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8913374..f71ed2f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2693,8 +2693,16 @@ loop_again:
                                 * failure risk. Hurry up!
                                 */
                                if (!zone_watermark_ok_safe(zone, order,
-                                           min_wmark_pages(zone), end_zone, 0))
-                                       has_under_min_watermark_zone = 1;
+                                           min_wmark_pages(zone), end_zone, 0)) {
+                                       /*
+                                        * In case of big reserved page for higher zone,
+                                        * it is pointless to try reclaimaing pages quickly
+                                        * because it could never meet the requirement.
+                                        */
+                                       if (zone->present_pages >
+                                               min_wmark_pages(zone) + zone->lowmem_reserve[end_zone])
+                                               has_under_min_watermark_zone = 1;
+                               }
                        } else {
                                /*
                                 * If a zone reaches its high watermark,

Even, we could apply this at starting of the loop so that we can avoid unnecessary scanning st the beginning.
In that case, we have to apply zone->lowmem_reserve[end_zone] only because we have to consider NO_WATERMARK alloc case.

> 
> Or is something below better? we can avoid the big reserved pages
> accounting to the min_wmark_pages for low zone. if high zone is under
> min_wmark, kswapd will not sleep.
>                                if (!zone_watermark_ok_safe(zone, order,
> -                                            min_wmark_pages(zone), end_zone, 0))
> +                                            min_wmark_pages(zone), 0, 0))
>                                         has_under_min_watermark_zone = 1;

I think it's not a good idea since page allocator always considers classzone_idx.
So although we fix kswapd issue through your changes, page allocator still can't allocate memory
and wakes up kswapd, again.

> 
> Thanks,
> Shaohua
> 

-- 
Kinds regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]