On Tue, Oct 18, 2011 at 10:13:52AM +0800, Shaohua Li wrote: > On Wed, 2011-10-12 at 15:59 +0800, Minchan Kim wrote: > > On Wed, Oct 12, 2011 at 10:48:59AM +0800, Shaohua Li wrote: > > > > there are two cases one zone is below min_watermark. > > > > > 1. the zone is below min_watermark for allocation in the zone. in this > > > > > case we need hurry up. > > > > > 2. the zone is below min_watermark for allocation from high zone. we > > > > > don't really need hurry up if other zone is above min_watermark. > > > > > Since low zone need to reserve pages for high zone, the second case > > > > > could be common. > > > > > > > > You mean "lowmem_reserve"? > > > > It means opposite. It is a mechanism to defend using of lowmem pages from high zone allocation > > > > because it could be fatal to allow process pages to be allocated from low zone. > > > > Also, We could set each ratio for reserved pages of zones. > > > > How could we make sure lower zones have enough free pages for higher zone? > > > lowmem_reserve causes the problem, but it's not a fault of > > > lowmem_reserve. I'm thinking changing it. > > > > > > > > Yes, keeping kswapd running in this case can reduce the chance > > > > > GFP_ATOMIC failure. But my patch will not cause immediate failure > > > > > because there is still some zones which are above min_watermark and can > > > > > meet the GFP_ATOMIC allocation. And keeping kswapd running has some > > > > > > > > True. It was why I said "I don't mean you are wrong but we are very careful about this." > > > > Normally, it could handle but might fail on sudden peak of atomic allocation stream. > > > > Recently, we have suffered from many reporting of GFP_AOTMIC allocation than olded. > > > > So I would like to be very careful and that's why I suggest we need at least some experiment. > > > > Through it, Andrew could make final call. > > > sure. > > > > > > > > drawbacks: > > > > > 1. cpu overhead > > > > > 2. extend isolate window size, so trash working set. > > > > > Considering DMA zone, we almost always have DMA zone min_watermark not > > > > > ok for any allocation from high zone. So we will always have such > > > > > drawbacks. > > > > > > > > I agree with you in that it's a problem. > > > > I think the real solution is to remove the zone from allocation fallback list in such case > > > > because the lower zone could never meet any allocation for the higher zone. > > > > But it makes code rather complicated as we have to consider admin who can change > > > > reserved pages anytime. > > > Not worthy the complication. > > > > > > > So how about this? > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > index 8913374..f71ed2f 100644 > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -2693,8 +2693,16 @@ loop_again: > > > > * failure risk. Hurry up! > > > > */ > > > > if (!zone_watermark_ok_safe(zone, order, > > > > - min_wmark_pages(zone), end_zone, 0)) > > > > - has_under_min_watermark_zone = 1; > > > > + min_wmark_pages(zone), end_zone, 0)) { > > > > + /* > > > > + * In case of big reserved page for higher zone, > > > > + * it is pointless to try reclaimaing pages quickly > > > > + * because it could never meet the requirement. > > > > + */ > > > > + if (zone->present_pages > > > > > + min_wmark_pages(zone) + zone->lowmem_reserve[end_zone]) > > > > + has_under_min_watermark_zone = 1; > > > > + } > > > > } else { > > > > /* > > > > * If a zone reaches its high watermark, > > > This looks like a workaround just for DMA zone. present_pages could be > > > bigger than min_mwark+lowmem_reserve. And We still suffered from the > > > issue, for example, a DMA32 zone with some pages allocated for DMA, or a > > > zone which has some lru pages, but still much smaller than high zone. > > > > Right. I thought about it but couldn't have a good idea for it. :( > > > > > > > > > Even, we could apply this at starting of the loop so that we can avoid unnecessary scanning st the beginning. > > > > In that case, we have to apply zone->lowmem_reserve[end_zone] only because we have to consider NO_WATERMARK alloc case. > > > yes, we can do this to avoid unnecessary scan. but DMA zone hasn't lru > > > pages, so not sure how big the benefit is here. > > > > At least, we can prevent has_under_min_watermark_zone from set. > > But it still have a problem you pointed out earlier. > > > > > > > > > > Or is something below better? we can avoid the big reserved pages > > > > > accounting to the min_wmark_pages for low zone. if high zone is under > > > > > min_wmark, kswapd will not sleep. > > > > > if (!zone_watermark_ok_safe(zone, order, > > > > > - min_wmark_pages(zone), end_zone, 0)) > > > > > + min_wmark_pages(zone), 0, 0)) > > > > > has_under_min_watermark_zone = 1; > > > > > > > > I think it's not a good idea since page allocator always considers classzone_idx. > > > > So although we fix kswapd issue through your changes, page allocator still can't allocate memory > > > > and wakes up kswapd, again. > > > why kswapd will be waked up again? The high zone itself still has > > > min_wark+low_reserve ok for the allocation(classzone_idx 0 means > > > checking the low_reserve for allocation from the zone itself), so the > > > allocation can be met. > > > > You're absolutely right. > > I got confused. Sorry about that. > > > > I like this than your old version. > > That's because it could rush if one of zonelist is consumed as below min_watermak. > > It could mitigate GFP_ALLOC fail than yours old version but still would be higher than now. > > So, we need the number. > > > > Could you repost this as formal patch with good comment and number? > > Personally, I like description based on scenario with kind step-by-step. > > Feel free to use my description in my patch if you want. > > > > Thanks for patient discussion in spite of my irregular reply, Shaohua. > Thanks for your time. Here is patch. I'm trying to get some number, but > didn't find a proper workload to demonstrate the atomic allocation > failure. Any suggestion for the workload? Sorry for really really late response. Hmm.. I didn't have a such test. But it seems recently we had a such report, again. https://lkml.org/lkml/2011/10/18/131 If you want it, you could ask him. > > Thanks, > Shaohua > > > Subject: vmscan: correctly detect GFP_ATOMIC allocation failure -v3 > > has_under_min_watermark_zone is used to detect if there is GFP_ATOMIC allocation > failure risk. Current logic is if any zone has min watermark not ok, we have > risk. > > Low zone needs reserve memory to avoid fallback from high zone. The reserved > memory is zone->lowmem_reserve[]. If high zone is big, low zone's > min_wmark + lowmem_reserve[] usually is big. Sometimes min_wmark + > lowmem_reserve[] could even be higher than zone->present_pages. An example is > DMA zone. Other low zones could have the similar high reserved memory though > might still have margins between reserved pages and present pages. So in kswapd > loop, if end_zone is a high zone, has_under_min_watermark_zone could be easily > set or always set for DMA. > > Let's consider end_zone is a high zone and it has high_mwark not ok, but > min_mwark ok. A DMA zone always has present_pages less than reserved pages, so > has_under_min_watermark_zone is always set. When kswapd is running, there are > some drawbacks: > 1. kswapd can keep unnecessary running without congestion_wait. high zone > already can meet GFP_ATOMIC. The running will waste some CPU. > 2. kswapd can scan much more pages to trash working set. congestion_wait can > slow down scan if kswapd has trouble. Now congestion_wait is skipped, kswapd > will keep scanning unnecessary pages. > > So since DMA zone always set has_under_min_watermark_zone, current logic actually > equals to that kswapd keeps running without congestion_wait till high zone has > high wmark ok when it has trouble. This is not intended. > > In this path, we test the min_mwark against the zone itself. This doesn't change > the behavior of high zone. For low zone, we now exclude lowmem_reserve for high > zone to avoid unnecessary running. > > Note: With this patch, we could have potential higher risk of GFP_ATOMIC failure. > > v3: Uses a less intrusive method to determine if has_under_min_watermark_zone > should be set after discussion with Minchan. > v2: use bool and clear has_under_min_watermark_zone for zone with watermark ok > as suggested by David Rientjes. > > Signed-off-by: Shaohua Li <shaohua.li@xxxxxxxxx> Reviewed-by: Minchan Kim <minchan.kim@xxxxxxxxx> As you know, I give my reviewed-by as a token of "code looks good to me". But still, we see atomic allocation failure messasge recently. So you need to get a acked-by as a toekn of "I support this idea" of others(ie, Mel/Rik are right person). I hope it would be better to write down it in description that it's real problem of your system and the patch solves it. Thanks! > --- > mm/vmscan.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > Index: linux/mm/vmscan.c > =================================================================== > --- linux.orig/mm/vmscan.c 2011-10-17 16:02:30.000000000 +0800 > +++ linux/mm/vmscan.c 2011-10-18 09:32:23.000000000 +0800 > @@ -2463,7 +2463,7 @@ loop_again: > > for (priority = DEF_PRIORITY; priority >= 0; priority--) { > unsigned long lru_pages = 0; > - int has_under_min_watermark_zone = 0; > + bool has_under_min_watermark_zone = false; > > /* The swap token gets in the way of swapout... */ > if (!priority) > @@ -2593,8 +2593,8 @@ loop_again: > * failure risk. Hurry up! > */ > if (!zone_watermark_ok_safe(zone, order, > - min_wmark_pages(zone), end_zone, 0)) > - has_under_min_watermark_zone = 1; > + min_wmark_pages(zone), 0, 0)) > + has_under_min_watermark_zone = true; > } else { > /* > * If a zone reaches its high watermark, > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>