On Thu, Apr 28, 2011 at 9:36 PM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote: > Concurrent page allocations are suffering from high failure rates. > > On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB, > the page allocation failures are > > nr_alloc_fail 733    # interleaved reads by 1 single task > nr_alloc_fail 11799   # concurrent reads by 1000 tasks > > The concurrent read test script is: > >    Âfor i in `seq 1000` >    Âdo >        Âtruncate -s 1G /fs/sparse-$i >        Âdd if=/fs/sparse-$i of=/dev/null & >    Âdone > With Core2 Duo, 3G ram, No swap partition I can not produce the alloc fail > In order for get_page_from_freelist() to get free page, > > (1) try_to_free_pages() should use much higher .nr_to_reclaim than the >  Âcurrent SWAP_CLUSTER_MAX=32, in order to draw the zone out of the >  Âpossible low watermark state as well as fill the pcp with enough free >  Âpages to overflow its high watermark. > > (2) the get_page_from_freelist() _after_ direct reclaim should use lower >  Âwatermark than its normal invocations, so that it can reasonably >  Â"reserve" some free pages for itself and prevent other concurrent >  Âpage allocators stealing all its reclaimed pages. > > Some notes: > > - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct > Âreclaim allocation fails") has the same target, however is obviously > Âcostly and less effective. It seems more clean to just remove the > Âretry and drain code than to retain it. > > - it's a bit hacky to reclaim more than requested pages inside > Âdo_try_to_free_page(), and it won't help cgroup for now > > - it only aims to reduce failures when there are plenty of reclaimable > Âpages, so it stops the opportunistic reclaim when scanned 2 times pages > > Test results: > > - the failure rate is pretty sensible to the page reclaim size, > Âfrom 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX) > > - the IPIs are reduced by over 100 times > > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch > ------------------------------------------------------------------------------- > nr_alloc_fail 10496 > allocstall 1576602 > > slabs_scanned 21632 > kswapd_steal 4393382 > kswapd_inodesteal 124 > kswapd_low_wmark_hit_quickly 885 > kswapd_high_wmark_hit_quickly 2321 > kswapd_skip_congestion_wait 0 > pageoutrun 29426 > > CAL:   220449   220246   220372   220558   220251   219740   220043   219968  Function call interrupts > > LOC:   536274   532529   531734   536801   536510   533676   534853   532038  Local timer interrupts > RES:    3032    2128    1792    1765    2184    1703    1754    1865  Rescheduling interrupts > TLB:    Â189     15     13     17     64    Â294     97     63  TLB shootdowns Could you tell how to get above info? > > patched (WMARK_MIN) > ------------------- > nr_alloc_fail 704 > allocstall 105551 > > slabs_scanned 33280 > kswapd_steal 4525537 > kswapd_inodesteal 187 > kswapd_low_wmark_hit_quickly 4980 > kswapd_high_wmark_hit_quickly 2573 > kswapd_skip_congestion_wait 0 > pageoutrun 35429 > > CAL:     93    Â286    Â396    Â754    Â272    Â297    Â275    Â281  Function call interrupts > > LOC:   520550   517751   517043   522016   520302   518479   519329   517179  Local timer interrupts > RES:    2131    1371    1376    1269    1390    1181    1409    1280  Rescheduling interrupts > TLB:    Â280     26     27     30     65    Â305    Â134     75  TLB shootdowns > > patched (WMARK_HIGH) > -------------------- > nr_alloc_fail 282 > allocstall 53860 > > slabs_scanned 23936 > kswapd_steal 4561178 > kswapd_inodesteal 0 > kswapd_low_wmark_hit_quickly 2760 > kswapd_high_wmark_hit_quickly 1748 > kswapd_skip_congestion_wait 0 > pageoutrun 32639 > > CAL:     93    Â463    Â410    Â540    Â298    Â282    Â272    Â306  Function call interrupts > > LOC:   513956   510749   509890   514897   514300   512392   512825   510574  Local timer interrupts > RES:    1174    2081    1411    1320    1742    2683    1380    1230  Rescheduling interrupts > TLB:    Â274     21     19     22     57    Â317    Â131     61  TLB shootdowns > > this patch (WMARK_HIGH, limited scan) > ------------------------------------- > nr_alloc_fail 276 > allocstall 54034 > > slabs_scanned 24320 > kswapd_steal 4507482 > kswapd_inodesteal 262 > kswapd_low_wmark_hit_quickly 2638 > kswapd_high_wmark_hit_quickly 1710 > kswapd_skip_congestion_wait 0 > pageoutrun 32182 > > CAL:     69    Â443    Â421    Â567    Â273    Â279    Â269    Â334  Function call interrupts > > LOC:   514736   511698   510993   514069   514185   512986   513838   511229  Local timer interrupts > RES:    2153    1556    1126    1351    3047    1554    1131    1560  Rescheduling interrupts > TLB:    Â209     26     20     15     71    Â315    Â117     71  TLB shootdowns > > CC: Mel Gorman <mel@xxxxxxxxxxxxxxxxxx> > Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> > --- > Âmm/page_alloc.c |  17 +++-------------- > Âmm/vmscan.c   |  Â6 ++++++ > Â2 files changed, 9 insertions(+), 14 deletions(-) > --- linux-next.orig/mm/vmscan.c 2011-04-28 21:16:16.000000000 +0800 > +++ linux-next/mm/vmscan.c   Â2011-04-28 21:28:57.000000000 +0800 > @@ -1978,6 +1978,8 @@ static void shrink_zones(int priority, s >                Âcontinue; >            Âif (zone->all_unreclaimable && priority != DEF_PRIORITY) >                Âcontinue;    /* Let kswapd poll it */ > +            sc->nr_to_reclaim = max(sc->nr_to_reclaim, > +                        zone->watermark[WMARK_HIGH]); >        Â} > >        Âshrink_zone(priority, zone, sc); > @@ -2034,6 +2036,7 @@ static unsigned long do_try_to_free_page >    Âstruct zoneref *z; >    Âstruct zone *zone; >    Âunsigned long writeback_threshold; > +    unsigned long min_reclaim = sc->nr_to_reclaim; > >    Âget_mems_allowed(); >    Âdelayacct_freepages_start(); > @@ -2067,6 +2070,9 @@ static unsigned long do_try_to_free_page >            Â} >        Â} >        Âtotal_scanned += sc->nr_scanned; > +        if (sc->nr_reclaimed >= min_reclaim && > +          total_scanned > 2 * sc->nr_to_reclaim) > +            goto out; >        Âif (sc->nr_reclaimed >= sc->nr_to_reclaim) >            Âgoto out; > > --- linux-next.orig/mm/page_alloc.c   2011-04-28 21:16:16.000000000 +0800 > +++ linux-next/mm/page_alloc.c Â2011-04-28 21:16:18.000000000 +0800 > @@ -1888,9 +1888,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m >    Ânodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, >    Âint migratetype, unsigned long *did_some_progress) > Â{ > -    struct page *page = NULL; > +    struct page *page; >    Âstruct reclaim_state reclaim_state; > -    bool drained = false; > >    Âcond_resched(); > > @@ -1912,22 +1911,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m >    Âif (unlikely(!(*did_some_progress))) >        Âreturn NULL; > > -retry: > +    alloc_flags |= ALLOC_HARDER; > + >    Âpage = get_page_from_freelist(gfp_mask, nodemask, order, >                    Âzonelist, high_zoneidx, >                    Âalloc_flags, preferred_zone, >                    Âmigratetype); > - > -    /* > -    Â* If an allocation failed after direct reclaim, it could be because > -    Â* pages are pinned on the per-cpu lists. Drain them and try again > -    Â*/ > -    if (!page && !drained) { > -        drain_all_pages(); > -        drained = true; > -        goto retry; > -    } > - >    Âreturn page; > Â} > > -- Regards dave ÿô.nÇ·ÿ±ég¬±¨Âaþé»®&Þ)î¦þ)íèh¨è&£ù¢¸ÿæ¢ú»þÇþm§ÿÿÃÿ)î¦þàbnö¥yÊ{^®wr«ë&§iÖ²('Ûÿÿìm éê¯Ãí¢ÿÚ·ÚýiÉ¢¸ÿý½§$þàÿ