On Wed, May 4, 2011 at 9:56 AM, Dave Young <hidave.darkstar@xxxxxxxxx> wrote: > On Thu, Apr 28, 2011 at 9:36 PM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote: >> Concurrent page allocations are suffering from high failure rates. >> >> On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB, >> the page allocation failures are >> >> nr_alloc_fail 733    # interleaved reads by 1 single task >> nr_alloc_fail 11799   # concurrent reads by 1000 tasks >> >> The concurrent read test script is: >> >>    Âfor i in `seq 1000` >>    Âdo >>        Âtruncate -s 1G /fs/sparse-$i >>        Âdd if=/fs/sparse-$i of=/dev/null & >>    Âdone >> > > With Core2 Duo, 3G ram, No swap partition I can not produce the alloc fail unset CONFIG_SCHED_AUTOGROUP and CONFIG_CGROUP_SCHED seems affects the test results, now I see several nr_alloc_fail (dd is not finished yet): dave@darkstar-32:$ grep fail /proc/vmstat: nr_alloc_fail 4 compact_pagemigrate_failed 0 compact_fail 3 htlb_buddy_alloc_fail 0 thp_collapse_alloc_fail 4 So the result is related to cpu scheduler. > >> In order for get_page_from_freelist() to get free page, >> >> (1) try_to_free_pages() should use much higher .nr_to_reclaim than the >>  Âcurrent SWAP_CLUSTER_MAX=32, in order to draw the zone out of the >>  Âpossible low watermark state as well as fill the pcp with enough free >>  Âpages to overflow its high watermark. >> >> (2) the get_page_from_freelist() _after_ direct reclaim should use lower >>  Âwatermark than its normal invocations, so that it can reasonably >>  Â"reserve" some free pages for itself and prevent other concurrent >>  Âpage allocators stealing all its reclaimed pages. >> >> Some notes: >> >> - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct >> Âreclaim allocation fails") has the same target, however is obviously >> Âcostly and less effective. It seems more clean to just remove the >> Âretry and drain code than to retain it. >> >> - it's a bit hacky to reclaim more than requested pages inside >> Âdo_try_to_free_page(), and it won't help cgroup for now >> >> - it only aims to reduce failures when there are plenty of reclaimable >> Âpages, so it stops the opportunistic reclaim when scanned 2 times pages >> >> Test results: >> >> - the failure rate is pretty sensible to the page reclaim size, >> Âfrom 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX) >> >> - the IPIs are reduced by over 100 times >> >> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch >> ------------------------------------------------------------------------------- >> nr_alloc_fail 10496 >> allocstall 1576602 >> >> slabs_scanned 21632 >> kswapd_steal 4393382 >> kswapd_inodesteal 124 >> kswapd_low_wmark_hit_quickly 885 >> kswapd_high_wmark_hit_quickly 2321 >> kswapd_skip_congestion_wait 0 >> pageoutrun 29426 >> >> CAL:   220449   220246   220372   220558   220251   219740   220043   219968  Function call interrupts >> >> LOC:   536274   532529   531734   536801   536510   533676   534853   532038  Local timer interrupts >> RES:    3032    2128    1792    1765    2184    1703    1754    1865  Rescheduling interrupts >> TLB:    Â189     15     13     17     64    Â294     97     63  TLB shootdowns > > Could you tell how to get above info? > >> >> patched (WMARK_MIN) >> ------------------- >> nr_alloc_fail 704 >> allocstall 105551 >> >> slabs_scanned 33280 >> kswapd_steal 4525537 >> kswapd_inodesteal 187 >> kswapd_low_wmark_hit_quickly 4980 >> kswapd_high_wmark_hit_quickly 2573 >> kswapd_skip_congestion_wait 0 >> pageoutrun 35429 >> >> CAL:     93    Â286    Â396    Â754    Â272    Â297    Â275    Â281  Function call interrupts >> >> LOC:   520550   517751   517043   522016   520302   518479   519329   517179  Local timer interrupts >> RES:    2131    1371    1376    1269    1390    1181    1409    1280  Rescheduling interrupts >> TLB:    Â280     26     27     30     65    Â305    Â134     75  TLB shootdowns >> >> patched (WMARK_HIGH) >> -------------------- >> nr_alloc_fail 282 >> allocstall 53860 >> >> slabs_scanned 23936 >> kswapd_steal 4561178 >> kswapd_inodesteal 0 >> kswapd_low_wmark_hit_quickly 2760 >> kswapd_high_wmark_hit_quickly 1748 >> kswapd_skip_congestion_wait 0 >> pageoutrun 32639 >> >> CAL:     93    Â463    Â410    Â540    Â298    Â282    Â272    Â306  Function call interrupts >> >> LOC:   513956   510749   509890   514897   514300   512392   512825   510574  Local timer interrupts >> RES:    1174    2081    1411    1320    1742    2683    1380    1230  Rescheduling interrupts >> TLB:    Â274     21     19     22     57    Â317    Â131     61  TLB shootdowns >> >> this patch (WMARK_HIGH, limited scan) >> ------------------------------------- >> nr_alloc_fail 276 >> allocstall 54034 >> >> slabs_scanned 24320 >> kswapd_steal 4507482 >> kswapd_inodesteal 262 >> kswapd_low_wmark_hit_quickly 2638 >> kswapd_high_wmark_hit_quickly 1710 >> kswapd_skip_congestion_wait 0 >> pageoutrun 32182 >> >> CAL:     69    Â443    Â421    Â567    Â273    Â279    Â269    Â334  Function call interrupts >> >> LOC:   514736   511698   510993   514069   514185   512986   513838   511229  Local timer interrupts >> RES:    2153    1556    1126    1351    3047    1554    1131    1560  Rescheduling interrupts >> TLB:    Â209     26     20     15     71    Â315    Â117     71  TLB shootdowns >> >> CC: Mel Gorman <mel@xxxxxxxxxxxxxxxxxx> >> Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> >> --- >> Âmm/page_alloc.c |  17 +++-------------- >> Âmm/vmscan.c   |  Â6 ++++++ >> Â2 files changed, 9 insertions(+), 14 deletions(-) >> --- linux-next.orig/mm/vmscan.c 2011-04-28 21:16:16.000000000 +0800 >> +++ linux-next/mm/vmscan.c   Â2011-04-28 21:28:57.000000000 +0800 >> @@ -1978,6 +1978,8 @@ static void shrink_zones(int priority, s >>                Âcontinue; >>            Âif (zone->all_unreclaimable && priority != DEF_PRIORITY) >>                Âcontinue;    /* Let kswapd poll it */ >> +            sc->nr_to_reclaim = max(sc->nr_to_reclaim, >> +                        zone->watermark[WMARK_HIGH]); >>        Â} >> >>        Âshrink_zone(priority, zone, sc); >> @@ -2034,6 +2036,7 @@ static unsigned long do_try_to_free_page >>    Âstruct zoneref *z; >>    Âstruct zone *zone; >>    Âunsigned long writeback_threshold; >> +    unsigned long min_reclaim = sc->nr_to_reclaim; >> >>    Âget_mems_allowed(); >>    Âdelayacct_freepages_start(); >> @@ -2067,6 +2070,9 @@ static unsigned long do_try_to_free_page >>            Â} >>        Â} >>        Âtotal_scanned += sc->nr_scanned; >> +        if (sc->nr_reclaimed >= min_reclaim && >> +          total_scanned > 2 * sc->nr_to_reclaim) >> +            goto out; >>        Âif (sc->nr_reclaimed >= sc->nr_to_reclaim) >>            Âgoto out; >> >> --- linux-next.orig/mm/page_alloc.c   2011-04-28 21:16:16.000000000 +0800 >> +++ linux-next/mm/page_alloc.c Â2011-04-28 21:16:18.000000000 +0800 >> @@ -1888,9 +1888,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m >>    Ânodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, >>    Âint migratetype, unsigned long *did_some_progress) >> Â{ >> -    struct page *page = NULL; >> +    struct page *page; >>    Âstruct reclaim_state reclaim_state; >> -    bool drained = false; >> >>    Âcond_resched(); >> >> @@ -1912,22 +1911,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m >>    Âif (unlikely(!(*did_some_progress))) >>        Âreturn NULL; >> >> -retry: >> +    alloc_flags |= ALLOC_HARDER; >> + >>    Âpage = get_page_from_freelist(gfp_mask, nodemask, order, >>                    Âzonelist, high_zoneidx, >>                    Âalloc_flags, preferred_zone, >>                    Âmigratetype); >> - >> -    /* >> -    Â* If an allocation failed after direct reclaim, it could be because >> -    Â* pages are pinned on the per-cpu lists. Drain them and try again >> -    Â*/ >> -    if (!page && !drained) { >> -        drain_all_pages(); >> -        drained = true; >> -        goto retry; >> -    } >> - >>    Âreturn page; >> Â} >> >> > > > > -- > Regards > dave > -- Regards dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href