Hi Wu, Sorry for slow response. I guess you know why I am slow. :) On Mon, May 2, 2011 at 7:29 PM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote: > Hi Minchan, > > On Mon, May 02, 2011 at 12:35:42AM +0800, Minchan Kim wrote: >> Hi Wu, >> >> On Sat, Apr 30, 2011 at 10:17:41PM +0800, Wu Fengguang wrote: >> > On Fri, Apr 29, 2011 at 10:28:24AM +0800, Wu Fengguang wrote: >> > > > Test results: >> > > > >> > > > - the failure rate is pretty sensible to the page reclaim size, >> > > >  from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX) >> > > > >> > > > - the IPIs are reduced by over 100 times >> > > >> > > It's reduced by 500 times indeed. >> > > >> > > CAL:   220449   220246   220372   220558   220251   219740   220043   219968  Function call interrupts >> > > CAL:     93    Â463    Â410    Â540    Â298    Â282    Â272    Â306  Function call interrupts >> > > >> > > > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch >> > > > ------------------------------------------------------------------------------- >> > > > nr_alloc_fail 10496 >> > > > allocstall 1576602 >> > > >> > > > patched (WMARK_MIN) >> > > > ------------------- >> > > > nr_alloc_fail 704 >> > > > allocstall 105551 >> > > >> > > > patched (WMARK_HIGH) >> > > > -------------------- >> > > > nr_alloc_fail 282 >> > > > allocstall 53860 >> > > >> > > > this patch (WMARK_HIGH, limited scan) >> > > > ------------------------------------- >> > > > nr_alloc_fail 276 >> > > > allocstall 54034 >> > > >> > > There is a bad side effect though: the much reduced "allocstall" means >> > > each direct reclaim will take much more time to complete. A simple solution >> > > is to terminate direct reclaim after 10ms. I noticed that an 100ms >> > > time threshold can reduce the reclaim latency from 621ms to 358ms. >> > > Further lowering the time threshold to 20ms does not help reducing the >> > > real latencies though. >> > >> > Experiments going on... >> > >> > I tried the more reasonable terminate condition: stop direct reclaim >> > when the preferred zone is above high watermark (see the below chunk). >> > >> > This helps reduce the average reclaim latency to under 100ms in the >> > 1000-dd case. >> > >> > However nr_alloc_fail is around 5000 and not ideal. The interesting >> > thing is, even if zone watermark is high, the task still may fail to >> > get a free page.. >> > >> > @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page >> >             } >> >         } >> >         total_scanned += sc->nr_scanned; >> > -        if (sc->nr_reclaimed >= sc->nr_to_reclaim) >> > -            goto out; >> > +        if (sc->nr_reclaimed >= min_reclaim) { >> > +            if (sc->nr_reclaimed >= sc->nr_to_reclaim) >> > +                goto out; >> > +            if (total_scanned > 2 * sc->nr_to_reclaim) >> > +                goto out; >> > +            if (preferred_zone && >> > +              zone_watermark_ok_safe(preferred_zone, sc->order, >> > +                    high_wmark_pages(preferred_zone), >> > +                    zone_idx(preferred_zone), 0)) >> > +                goto out; >> > +        } >> > >> >         /* >> >         Â* Try to write back as many pages as we just scanned. ÂThis >> > >> > Thanks, >> > Fengguang >> > --- >> > Subject: mm: cut down __GFP_NORETRY page allocation failures >> > Date: Thu Apr 28 13:46:39 CST 2011 >> > >> > Concurrent page allocations are suffering from high failure rates. >> > >> > On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB, >> > the page allocation failures are >> > >> > nr_alloc_fail 733   # interleaved reads by 1 single task >> > nr_alloc_fail 11799  # concurrent reads by 1000 tasks >> > >> > The concurrent read test script is: >> > >> >    for i in `seq 1000` >> >    do >> >        truncate -s 1G /fs/sparse-$i >> >        dd if=/fs/sparse-$i of=/dev/null & >> >    done >> > >> > In order for get_page_from_freelist() to get free page, >> > >> > (1) try_to_free_pages() should use much higher .nr_to_reclaim than the >> >   current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the >> >   possible low watermark state as well as fill the pcp with enough free >> >   pages to overflow its high watermark. >> > >> > (2) the get_page_from_freelist() _after_ direct reclaim should use lower >> >   watermark than its normal invocations, so that it can reasonably >> >   "reserve" some free pages for itself and prevent other concurrent >> >   page allocators stealing all its reclaimed pages. >> >> Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea. >> http://marc.info/?l=linux-mm&m=129187231129887&w=4 >> The idea is to keep a page at leat for direct reclaimed process. >> Could it mitigate your problem or could you enhacne the idea? >> I think it's very simple and fair solution. > > No it's not helping my problem, nr_alloc_fail and CAL are still high: Unfortunately, my patch doesn't consider order-0 pages, as you mentioned below. I read your mail which states it doesn't help although it considers order-0 pages and drain. Actually, I tried to look into that but in my poor system(core2duo, 2G ram), nr_alloc_fail never happens. :( I will try it in other desktop but I am not sure I can reproduce it. > > root@fat /home/wfg# ./test-dd-sparse.sh > start time: 246 > total time: 531 > nr_alloc_fail 14097 > allocstall 1578332 > LOC:   542698   538947   536986   567118   552114   539605   541201   537623  Local timer interrupts > RES:    3368    1908    1474    1476    2809    1602    1500    1509  Rescheduling interrupts > CAL:   223844   224198   224268   224436   223952   224056   223700   223743  Function call interrupts > TLB:    Â381     27     22     19     96    Â404    Â111     67  TLB shootdowns > > root@fat /home/wfg# getdelays -dip `pidof dd` > print delayacct stats ON > printing IO accounting > PID   5202 > > > CPU       count   real total Âvirtual total  Âdelay total >         1132   3635447328   3627947550  276722091605 > IO       Âcount  Âdelay total Âdelay average >          Â2   Â187809974       62ms > SWAP      Âcount  Âdelay total Âdelay average >          Â0       Â0       Â0ms > RECLAIM     count  Âdelay total Âdelay average >         1334  Â35304580824       26ms > dd: read=278528, write=0, cancelled_write=0 > > I guess your patch is mainly fixing the high order allocations while > my workload is mainly order 0 readahead page allocations. There are > 1000 forks, however the "start time: 246" seems to indicate that the > order-1 reclaim latency is not improved. Maybe, 8K * 1000 isn't big footprint so I think reclaim doesn't happen. > > I'll try modifying your patch and see how it works out. The obvious > change is to apply it to the order-0 case. Hope this won't create much > more isolated pages. > > Attached is your patch rebased to 2.6.39-rc3, after resolving some > merge conflicts and fixing a trivial NULL pointer bug. Thanks! I would like to see detail with it in my system if I can reproduce it. > >> > >> > Some notes: >> > >> > - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct >> >  reclaim allocation fails") has the same target, however is obviously >> >  costly and less effective. It seems more clean to just remove the >> >  retry and drain code than to retain it. >> >> Tend to agree. >> My old patch can solve it, I think. > > Sadly nope. See above. > >> > >> > - it's a bit hacky to reclaim more than requested pages inside >> >  do_try_to_free_page(), and it won't help cgroup for now >> > >> > - it only aims to reduce failures when there are plenty of reclaimable >> >  pages, so it stops the opportunistic reclaim when scanned 2 times pages >> > >> > Test results: >> > >> > - the failure rate is pretty sensible to the page reclaim size, >> >  from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX) >> > >> > - the IPIs are reduced by over 100 times >> > >> > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch >> > ------------------------------------------------------------------------------- >> > nr_alloc_fail 10496 >> > allocstall 1576602 >> > >> > slabs_scanned 21632 >> > kswapd_steal 4393382 >> > kswapd_inodesteal 124 >> > kswapd_low_wmark_hit_quickly 885 >> > kswapd_high_wmark_hit_quickly 2321 >> > kswapd_skip_congestion_wait 0 >> > pageoutrun 29426 >> > >> > CAL:   220449   220246   220372   220558   220251   219740   220043   219968  Function call interrupts >> > >> > LOC:   536274   532529   531734   536801   536510   533676   534853   532038  Local timer interrupts >> > RES:    3032    2128    1792    1765    2184    1703    1754    1865  Rescheduling interrupts >> > TLB:    Â189     15     13     17     64    Â294     97     63  TLB shootdowns >> > >> > patched (WMARK_MIN) >> > ------------------- >> > nr_alloc_fail 704 >> > allocstall 105551 >> > >> > slabs_scanned 33280 >> > kswapd_steal 4525537 >> > kswapd_inodesteal 187 >> > kswapd_low_wmark_hit_quickly 4980 >> > kswapd_high_wmark_hit_quickly 2573 >> > kswapd_skip_congestion_wait 0 >> > pageoutrun 35429 >> > >> > CAL:     93    Â286    Â396    Â754    Â272    Â297    Â275    Â281  Function call interrupts >> > >> > LOC:   520550   517751   517043   522016   520302   518479   519329   517179  Local timer interrupts >> > RES:    2131    1371    1376    1269    1390    1181    1409    1280  Rescheduling interrupts >> > TLB:    Â280     26     27     30     65    Â305    Â134     75  TLB shootdowns >> > >> > patched (WMARK_HIGH) >> > -------------------- >> > nr_alloc_fail 282 >> > allocstall 53860 >> > >> > slabs_scanned 23936 >> > kswapd_steal 4561178 >> > kswapd_inodesteal 0 >> > kswapd_low_wmark_hit_quickly 2760 >> > kswapd_high_wmark_hit_quickly 1748 >> > kswapd_skip_congestion_wait 0 >> > pageoutrun 32639 >> > >> > CAL:     93    Â463    Â410    Â540    Â298    Â282    Â272    Â306  Function call interrupts >> > >> > LOC:   513956   510749   509890   514897   514300   512392   512825   510574  Local timer interrupts >> > RES:    1174    2081    1411    1320    1742    2683    1380    1230  Rescheduling interrupts >> > TLB:    Â274     21     19     22     57    Â317    Â131     61  TLB shootdowns >> > >> > patched (WMARK_HIGH, limited scan) >> > ---------------------------------- >> > nr_alloc_fail 276 >> > allocstall 54034 >> > >> > slabs_scanned 24320 >> > kswapd_steal 4507482 >> > kswapd_inodesteal 262 >> > kswapd_low_wmark_hit_quickly 2638 >> > kswapd_high_wmark_hit_quickly 1710 >> > kswapd_skip_congestion_wait 0 >> > pageoutrun 32182 >> > >> > CAL:     69    Â443    Â421    Â567    Â273    Â279    Â269    Â334  Function call interrupts >> >> Looks amazing. > > Yeah, I have strong feelings against drain_all_pages() in the direct > reclaim path. The intuition is, once drain_all_pages() is called, the > later on direct reclaims will have less chance to fill the drained > buffers and therefore forced into drain_all_pages() again and again. > > drain_all_pages() is probably an overkill for preventing OOM. > Generally speaking, it's questionable to "squeeze the last page before > OOM". > > A typical desktop enters thrashing storms before OOM, as Hugh pointed > out, this may well not the end users wanted. I agree with him and > personally prefer some applications to be OOM killed rather than the > whole system goes unusable thrashing like mad. Tend to agree. The rule is applied to embedded system, too. Couldn't we mitigate draining just in case it is high order page. > >> > LOC:   514736   511698   510993   514069   514185   512986   513838   511229  Local timer interrupts >> > RES:    2153    1556    1126    1351    3047    1554    1131    1560  Rescheduling interrupts >> > TLB:    Â209     26     20     15     71    Â315    Â117     71  TLB shootdowns >> > >> > patched (WMARK_HIGH, limited scan, stop on watermark OK), 100 dd >> > ---------------------------------------------------------------- >> > >> > start time: 3 >> > total time: 50 >> > nr_alloc_fail 162 >> > allocstall 45523 >> > >> > CPU       count   real total Âvirtual total  Âdelay total >> >          921   3024540200   3009244668  Â37123129525 >> > IO       Âcount  Âdelay total Âdelay average >> >           0       Â0       Â0ms >> > SWAP      Âcount  Âdelay total Âdelay average >> >           0       Â0       Â0ms >> > RECLAIM     count  Âdelay total Âdelay average >> >          357   4891766796       13ms >> > dd: read=0, write=0, cancelled_write=0 >> > >> > patched (WMARK_HIGH, limited scan, stop on watermark OK), 1000 dd >> > ----------------------------------------------------------------- >> > >> > start time: 272 >> > total time: 509 >> > nr_alloc_fail 3913 >> > allocstall 541789 >> > >> > CPU       count   real total Âvirtual total  Âdelay total >> >         Â1044   3445476208   3437200482  229919915202 >> > IO       Âcount  Âdelay total Âdelay average >> >           0       Â0       Â0ms >> > SWAP      Âcount  Âdelay total Âdelay average >> >           0       Â0       Â0ms >> > RECLAIM     count  Âdelay total Âdelay average >> >          452  Â34691441605       76ms >> > dd: read=0, write=0, cancelled_write=0 >> > >> > patched (WMARK_HIGH, limited scan, stop on watermark OK, no time limit), 1000 dd >> > -------------------------------------------------------------------------------- >> > >> > start time: 278 >> > total time: 513 >> > nr_alloc_fail 4737 >> > allocstall 436392 >> > >> > >> > CPU       count   real total Âvirtual total  Âdelay total >> >         Â1024   3371487456   3359441487  225088210977 >> > IO       Âcount  Âdelay total Âdelay average >> >           1   Â160631171      Â160ms >> > SWAP      Âcount  Âdelay total Âdelay average >> >           0       Â0       Â0ms >> > RECLAIM     count  Âdelay total Âdelay average >> >          367  Â30809994722       83ms >> > dd: read=20480, write=0, cancelled_write=0 >> > >> > >> > no cond_resched(): >> >> What's this? > > I tried a modified patch that also removes the cond_resched() call in > __alloc_pages_direct_reclaim(), between try_to_free_pages() and > get_page_from_freelist(). It seems not helping noticeably. > > It looks safe to remove that cond_resched() as we already have such > calls in shrink_page_list(). I tried similar thing but Andrew have a concern about it. https://lkml.org/lkml/2011/3/24/138 > >> > >> > start time: 263 >> > total time: 516 >> > nr_alloc_fail 5144 >> > allocstall 436787 >> > >> > CPU       count   real total Âvirtual total  Âdelay total >> >         Â1018   3305497488   3283831119  241982934044 >> > IO       Âcount  Âdelay total Âdelay average >> >           0       Â0       Â0ms >> > SWAP      Âcount  Âdelay total Âdelay average >> >           0       Â0       Â0ms >> > RECLAIM     count  Âdelay total Âdelay average >> >          328  Â31398481378       95ms >> > dd: read=0, write=0, cancelled_write=0 >> > >> > zone_watermark_ok_safe(): >> > >> > start time: 266 >> > total time: 513 >> > nr_alloc_fail 4526 >> > allocstall 440246 >> > >> > CPU       count   real total Âvirtual total  Âdelay total >> >         Â1119   3640446568   3619184439  240945024724 >> > IO       Âcount  Âdelay total Âdelay average >> >           3   Â303620082      Â101ms >> > SWAP      Âcount  Âdelay total Âdelay average >> >           0       Â0       Â0ms >> > RECLAIM     count  Âdelay total Âdelay average >> >          372  Â27320731898       73ms >> > dd: read=77824, write=0, cancelled_write=0 >> > > >> > start time: 275 >> >> What's meaing of start time? > > It's the time taken to start 1000 dd's. > >> > total time: 517 >> >> Total time is elapsed time on your experiment? > > Yeah. They are generated with this script. > > $ cat ~/bin/test-dd-sparse.sh > > #!/bin/sh > > mount /dev/sda7 /fs > > tic=$(date +'%s') > > for i in `seq 1000` > do >    Âtruncate -s 1G /fs/sparse-$i >    Âdd if=/fs/sparse-$i of=/dev/null &>/dev/null & > done > > tac=$(date +'%s') > echo start time: $((tac-tic)) > > wait > > tac=$(date +'%s') > echo total time: $((tac-tic)) > > egrep '(nr_alloc_fail|allocstall)' /proc/vmstat > egrep '(CAL|RES|LOC|TLB)' /proc/interrupts > >> > nr_alloc_fail 4694 >> > allocstall 431021 >> > >> > >> > CPU       count   real total Âvirtual total  Âdelay total >> >         Â1073   3534462680   3512544928  234056498221 >> >> What's meaning of CPU fields? > > It's "waiting for a CPU (while being runnable)" as described in > Documentation/accounting/delay-accounting.txt. Thanks > >> > IO       Âcount  Âdelay total Âdelay average >> >           0       Â0       Â0ms >> > SWAP      Âcount  Âdelay total Âdelay average >> >           0       Â0       Â0ms >> > RECLAIM     count  Âdelay total Âdelay average >> >          386  Â34751778363       89ms >> > dd: read=0, write=0, cancelled_write=0 >> > >> >> Where is vanilla data for comparing latency? >> Personally, It's hard to parse your data. > > Sorry it's somehow too much data and kernel revisions.. The base kernel's > average latency is 29ms: > > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch > ------------------------------------------------------------------------------- > > CPU       count   real total Âvirtual total  Âdelay total >         1122   3676441096   3656793547  274182127286 > IO       Âcount  Âdelay total Âdelay average >          Â3   Â291765493       97ms > SWAP      Âcount  Âdelay total Âdelay average >          Â0       Â0       Â0ms > RECLAIM     count  Âdelay total Âdelay average >         1350  Â39229752193       29ms > dd: read=45056, write=0, cancelled_write=0 > > start time: 245 > total time: 526 > nr_alloc_fail 14586 > allocstall 1578343 > LOC:   533981   529210   528283   532346   533392   531314   531705   528983  Local timer interrupts > RES:    3123    2177    1676    1580    2157    1974    1606    1696  Rescheduling interrupts > CAL:   218392   218631   219167   219217   218840   218985   218429   218440  Function call interrupts > TLB:    Â175     13     21     18     62    Â309    Â119     42  TLB shootdowns > >> >> > CC: Mel Gorman <mel@xxxxxxxxxxxxxxxxxx> >> > Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> >> > --- >> > Âfs/buffer.c     Â|  Â4 ++-- >> > Âinclude/linux/swap.h |  Â3 ++- >> > Âmm/page_alloc.c   Â|  20 +++++--------------- >> > Âmm/vmscan.c     Â|  31 +++++++++++++++++++++++-------- >> > Â4 files changed, 32 insertions(+), 26 deletions(-) >> > --- linux-next.orig/mm/vmscan.c    2011-04-29 10:42:14.000000000 +0800 >> > +++ linux-next/mm/vmscan.c  Â2011-04-30 21:59:33.000000000 +0800 >> > @@ -2025,8 +2025,9 @@ static bool all_unreclaimable(struct zon >> >  * returns: Â0, if no pages reclaimed >> >  *      else, the number of pages reclaimed >> >  */ >> > -static unsigned long do_try_to_free_pages(struct zonelist *zonelist, >> > -                   struct scan_control *sc) >> > +static unsigned long do_try_to_free_pages(struct zone *preferred_zone, >> > +                    struct zonelist *zonelist, >> > +                    struct scan_control *sc) >> > Â{ >> >    int priority; >> >    unsigned long total_scanned = 0; >> > @@ -2034,6 +2035,7 @@ static unsigned long do_try_to_free_page >> >    struct zoneref *z; >> >    struct zone *zone; >> >    unsigned long writeback_threshold; >> > +   unsigned long min_reclaim = sc->nr_to_reclaim; >> >> Hmm, >> >> > >> >    get_mems_allowed(); >> >    delayacct_freepages_start(); >> > @@ -2041,6 +2043,9 @@ static unsigned long do_try_to_free_page >> >    if (scanning_global_lru(sc)) >> >        count_vm_event(ALLOCSTALL); >> > >> > +   if (preferred_zone) >> > +       sc->nr_to_reclaim += preferred_zone->watermark[WMARK_HIGH]; >> > + >> >> Hmm, I don't like this idea. >> The goal of direct reclaim path is to reclaim pages asap, I beleive. >> Many thing should be achieve of background kswapd. >> If admin changes min_free_kbytes, it can affect latency of direct reclaim. >> It doesn't make sense to me. > > Yeah, it does increase delays.. in the 1000 dd case, roughly from 30ms > to 90ms. This is a major drawback. Yes. > >> >    for (priority = DEF_PRIORITY; priority >= 0; priority--) { >> >        sc->nr_scanned = 0; >> >        if (!priority) >> > @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page >> >            } >> >        } >> >        total_scanned += sc->nr_scanned; >> > -       if (sc->nr_reclaimed >= sc->nr_to_reclaim) >> > -           goto out; >> > +       if (sc->nr_reclaimed >= min_reclaim) { >> > +           if (sc->nr_reclaimed >= sc->nr_to_reclaim) >> > +               goto out; >> >> I can't understand the logic. >> if nr_reclaimed is bigger than min_reclaim, it's always greater than >> nr_to_reclaim. What's meaning of min_reclaim? > > In direct reclaim, min_reclaim will be the legacy SWAP_CLUSTER_MAX and > sc->nr_to_reclaim will be increased to the zone's high watermark and > is kind of "max to reclaim". > >> >> > +           if (total_scanned > 2 * sc->nr_to_reclaim) >> > +               goto out; >> >> If there are lots of dirty pages in LRU? >> If there are lots of unevictable pages in LRU? >> If there are lots of mapped page in LRU but may_unmap = 0 cases? >> I means it's rather risky early conclusion. > > That test means to avoid scanning too much on __GFP_NORETRY direct > reclaims. My assumption for __GFP_NORETRY is, it should fail fast when > the LRU pages seem hard to reclaim. And the problem in the 1000 dd > case is, it's all easy to reclaim LRU pages but __GFP_NORETRY still > fails from time to time, with lots of IPIs that may hurt large > machines a lot. I don't have enough time and a environment to test it. So I can't make sure of it but my concern is a latency. If you solve latency problem considering CPU scaling, I won't oppose it. :) -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href