Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocation failures

Minchan Kim <minchan.kim@xxxxxxxxx> · Tue, 3 May 2011 09:49:20 +0900

Hi Wu, Sorry for slow response.
I guess you know why I am slow. :)

On Mon, May 2, 2011 at 7:29 PM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
> Hi Minchan,
>
> On Mon, May 02, 2011 at 12:35:42AM +0800, Minchan Kim wrote:
>> Hi Wu,
>>
>> On Sat, Apr 30, 2011 at 10:17:41PM +0800, Wu Fengguang wrote:
>> > On Fri, Apr 29, 2011 at 10:28:24AM +0800, Wu Fengguang wrote:
>> > > > Test results:
>> > > >
>> > > > - the failure rate is pretty sensible to the page reclaim size,
>> > > > Â from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
>> > > >
>> > > > - the IPIs are reduced by over 100 times
>> > >
>> > > It's reduced by 500 times indeed.
>> > >
>> > > CAL: Â Â 220449 Â Â 220246 Â Â 220372 Â Â 220558 Â Â 220251 Â Â 219740 Â Â 220043 Â Â 219968 Â Function call interrupts
>> > > CAL: Â Â Â Â 93 Â Â Â Â463 Â Â Â Â410 Â Â Â Â540 Â Â Â Â298 Â Â Â Â282 Â Â Â Â272 Â Â Â Â306 Â Function call interrupts
>> > >
>> > > > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
>> > > > -------------------------------------------------------------------------------
>> > > > nr_alloc_fail 10496
>> > > > allocstall 1576602
>> > >
>> > > > patched (WMARK_MIN)
>> > > > -------------------
>> > > > nr_alloc_fail 704
>> > > > allocstall 105551
>> > >
>> > > > patched (WMARK_HIGH)
>> > > > --------------------
>> > > > nr_alloc_fail 282
>> > > > allocstall 53860
>> > >
>> > > > this patch (WMARK_HIGH, limited scan)
>> > > > -------------------------------------
>> > > > nr_alloc_fail 276
>> > > > allocstall 54034
>> > >
>> > > There is a bad side effect though: the much reduced "allocstall" means
>> > > each direct reclaim will take much more time to complete. A simple solution
>> > > is to terminate direct reclaim after 10ms. I noticed that an 100ms
>> > > time threshold can reduce the reclaim latency from 621ms to 358ms.
>> > > Further lowering the time threshold to 20ms does not help reducing the
>> > > real latencies though.
>> >
>> > Experiments going on...
>> >
>> > I tried the more reasonable terminate condition: stop direct reclaim
>> > when the preferred zone is above high watermark (see the below chunk).
>> >
>> > This helps reduce the average reclaim latency to under 100ms in the
>> > 1000-dd case.
>> >
>> > However nr_alloc_fail is around 5000 and not ideal. The interesting
>> > thing is, even if zone watermark is high, the task still may fail to
>> > get a free page..
>> >
>> > @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
>> > Â Â Â Â Â Â Â Â Â Â Â Â }
>> > Â Â Â Â Â Â Â Â }
>> > Â Â Â Â Â Â Â Â total_scanned += sc->nr_scanned;
>> > - Â Â Â Â Â Â Â if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>> > - Â Â Â Â Â Â Â Â Â Â Â goto out;
>> > + Â Â Â Â Â Â Â if (sc->nr_reclaimed >= min_reclaim) {
>> > + Â Â Â Â Â Â Â Â Â Â Â if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â goto out;
>> > + Â Â Â Â Â Â Â Â Â Â Â if (total_scanned > 2 * sc->nr_to_reclaim)
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â goto out;
>> > + Â Â Â Â Â Â Â Â Â Â Â if (preferred_zone &&
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â zone_watermark_ok_safe(preferred_zone, sc->order,
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â high_wmark_pages(preferred_zone),
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â zone_idx(preferred_zone), 0))
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â goto out;
>> > + Â Â Â Â Â Â Â }
>> >
>> > Â Â Â Â Â Â Â Â /*
>> > Â Â Â Â Â Â Â Â Â* Try to write back as many pages as we just scanned. ÂThis
>> >
>> > Thanks,
>> > Fengguang
>> > ---
>> > Subject: mm: cut down __GFP_NORETRY page allocation failures
>> > Date: Thu Apr 28 13:46:39 CST 2011
>> >
>> > Concurrent page allocations are suffering from high failure rates.
>> >
>> > On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
>> > the page allocation failures are
>> >
>> > nr_alloc_fail 733 Â Â # interleaved reads by 1 single task
>> > nr_alloc_fail 11799 Â # concurrent reads by 1000 tasks
>> >
>> > The concurrent read test script is:
>> >
>> > Â Â Â for i in `seq 1000`
>> > Â Â Â do
>> > Â Â Â Â Â Â Â truncate -s 1G /fs/sparse-$i
>> > Â Â Â Â Â Â Â dd if=/fs/sparse-$i of=/dev/null &
>> > Â Â Â done
>> >
>> > In order for get_page_from_freelist() to get free page,
>> >
>> > (1) try_to_free_pages() should use much higher .nr_to_reclaim than the
>> > Â Â current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
>> > Â Â possible low watermark state as well as fill the pcp with enough free
>> > Â Â pages to overflow its high watermark.
>> >
>> > (2) the get_page_from_freelist() _after_ direct reclaim should use lower
>> > Â Â watermark than its normal invocations, so that it can reasonably
>> > Â Â "reserve" some free pages for itself and prevent other concurrent
>> > Â Â page allocators stealing all its reclaimed pages.
>>
>> Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea.
>> http://marc.info/?l=linux-mm&m=129187231129887&w=4
>> The idea is to keep a page at leat for direct reclaimed process.
>> Could it mitigate your problem or could you enhacne the idea?
>> I think it's very simple and fair solution.
>
> No it's not helping my problem, nr_alloc_fail and CAL are still high:

Unfortunately, my patch doesn't consider order-0 pages, as you mentioned below.
I read your mail which states it doesn't help although it considers
order-0 pages and drain.
Actually, I tried to look into that but in my poor system(core2duo, 2G
ram), nr_alloc_fail never happens. :(
I will try it in other desktop but I am not sure I can reproduce it.

>
> root@fat /home/wfg# ./test-dd-sparse.sh
> start time: 246
> total time: 531
> nr_alloc_fail 14097
> allocstall 1578332
> LOC: Â Â 542698 Â Â 538947 Â Â 536986 Â Â 567118 Â Â 552114 Â Â 539605 Â Â 541201 Â Â 537623 Â Local timer interrupts
> RES: Â Â Â 3368 Â Â Â 1908 Â Â Â 1474 Â Â Â 1476 Â Â Â 2809 Â Â Â 1602 Â Â Â 1500 Â Â Â 1509 Â Rescheduling interrupts
> CAL: Â Â 223844 Â Â 224198 Â Â 224268 Â Â 224436 Â Â 223952 Â Â 224056 Â Â 223700 Â Â 223743 Â Function call interrupts
> TLB: Â Â Â Â381 Â Â Â Â 27 Â Â Â Â 22 Â Â Â Â 19 Â Â Â Â 96 Â Â Â Â404 Â Â Â Â111 Â Â Â Â 67 Â TLB shootdowns
>
> root@fat /home/wfg# getdelays -dip `pidof dd`
> print delayacct stats ON
> printing IO accounting
> PID Â Â 5202
>
>
> CPU Â Â Â Â Â Â count Â Â real total Âvirtual total Â Âdelay total
> Â Â Â Â Â Â Â Â 1132 Â Â 3635447328 Â Â 3627947550 Â 276722091605
> IO Â Â Â Â Â Â Âcount Â Âdelay total Âdelay average
> Â Â Â Â Â Â Â Â Â Â2 Â Â Â187809974 Â Â Â Â Â Â 62ms
> SWAP Â Â Â Â Â Âcount Â Âdelay total Âdelay average
> Â Â Â Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0ms
> RECLAIM Â Â Â Â count Â Âdelay total Âdelay average
> Â Â Â Â Â Â Â Â 1334 Â Â35304580824 Â Â Â Â Â Â 26ms
> dd: read=278528, write=0, cancelled_write=0
>
> I guess your patch is mainly fixing the high order allocations while
> my workload is mainly order 0 readahead page allocations. There are
> 1000 forks, however the "start time: 246" seems to indicate that the
> order-1 reclaim latency is not improved.

Maybe, 8K * 1000 isn't big footprint so I think reclaim doesn't happen.

>
> I'll try modifying your patch and see how it works out. The obvious
> change is to apply it to the order-0 case. Hope this won't create much
> more isolated pages.
>
> Attached is your patch rebased to 2.6.39-rc3, after resolving some
> merge conflicts and fixing a trivial NULL pointer bug.

Thanks!
I would like to see detail with it in my system if I can reproduce it.

>
>> >
>> > Some notes:
>> >
>> > - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
>> > Â reclaim allocation fails") has the same target, however is obviously
>> > Â costly and less effective. It seems more clean to just remove the
>> > Â retry and drain code than to retain it.
>>
>> Tend to agree.
>> My old patch can solve it, I think.
>
> Sadly nope. See above.
>
>> >
>> > - it's a bit hacky to reclaim more than requested pages inside
>> > Â do_try_to_free_page(), and it won't help cgroup for now
>> >
>> > - it only aims to reduce failures when there are plenty of reclaimable
>> > Â pages, so it stops the opportunistic reclaim when scanned 2 times pages
>> >
>> > Test results:
>> >
>> > - the failure rate is pretty sensible to the page reclaim size,
>> > Â from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
>> >
>> > - the IPIs are reduced by over 100 times
>> >
>> > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
>> > -------------------------------------------------------------------------------
>> > nr_alloc_fail 10496
>> > allocstall 1576602
>> >
>> > slabs_scanned 21632
>> > kswapd_steal 4393382
>> > kswapd_inodesteal 124
>> > kswapd_low_wmark_hit_quickly 885
>> > kswapd_high_wmark_hit_quickly 2321
>> > kswapd_skip_congestion_wait 0
>> > pageoutrun 29426
>> >
>> > CAL: Â Â 220449 Â Â 220246 Â Â 220372 Â Â 220558 Â Â 220251 Â Â 219740 Â Â 220043 Â Â 219968 Â Function call interrupts
>> >
>> > LOC: Â Â 536274 Â Â 532529 Â Â 531734 Â Â 536801 Â Â 536510 Â Â 533676 Â Â 534853 Â Â 532038 Â Local timer interrupts
>> > RES: Â Â Â 3032 Â Â Â 2128 Â Â Â 1792 Â Â Â 1765 Â Â Â 2184 Â Â Â 1703 Â Â Â 1754 Â Â Â 1865 Â Rescheduling interrupts
>> > TLB: Â Â Â Â189 Â Â Â Â 15 Â Â Â Â 13 Â Â Â Â 17 Â Â Â Â 64 Â Â Â Â294 Â Â Â Â 97 Â Â Â Â 63 Â TLB shootdowns
>> >
>> > patched (WMARK_MIN)
>> > -------------------
>> > nr_alloc_fail 704
>> > allocstall 105551
>> >
>> > slabs_scanned 33280
>> > kswapd_steal 4525537
>> > kswapd_inodesteal 187
>> > kswapd_low_wmark_hit_quickly 4980
>> > kswapd_high_wmark_hit_quickly 2573
>> > kswapd_skip_congestion_wait 0
>> > pageoutrun 35429
>> >
>> > CAL: Â Â Â Â 93 Â Â Â Â286 Â Â Â Â396 Â Â Â Â754 Â Â Â Â272 Â Â Â Â297 Â Â Â Â275 Â Â Â Â281 Â Function call interrupts
>> >
>> > LOC: Â Â 520550 Â Â 517751 Â Â 517043 Â Â 522016 Â Â 520302 Â Â 518479 Â Â 519329 Â Â 517179 Â Local timer interrupts
>> > RES: Â Â Â 2131 Â Â Â 1371 Â Â Â 1376 Â Â Â 1269 Â Â Â 1390 Â Â Â 1181 Â Â Â 1409 Â Â Â 1280 Â Rescheduling interrupts
>> > TLB: Â Â Â Â280 Â Â Â Â 26 Â Â Â Â 27 Â Â Â Â 30 Â Â Â Â 65 Â Â Â Â305 Â Â Â Â134 Â Â Â Â 75 Â TLB shootdowns
>> >
>> > patched (WMARK_HIGH)
>> > --------------------
>> > nr_alloc_fail 282
>> > allocstall 53860
>> >
>> > slabs_scanned 23936
>> > kswapd_steal 4561178
>> > kswapd_inodesteal 0
>> > kswapd_low_wmark_hit_quickly 2760
>> > kswapd_high_wmark_hit_quickly 1748
>> > kswapd_skip_congestion_wait 0
>> > pageoutrun 32639
>> >
>> > CAL: Â Â Â Â 93 Â Â Â Â463 Â Â Â Â410 Â Â Â Â540 Â Â Â Â298 Â Â Â Â282 Â Â Â Â272 Â Â Â Â306 Â Function call interrupts
>> >
>> > LOC: Â Â 513956 Â Â 510749 Â Â 509890 Â Â 514897 Â Â 514300 Â Â 512392 Â Â 512825 Â Â 510574 Â Local timer interrupts
>> > RES: Â Â Â 1174 Â Â Â 2081 Â Â Â 1411 Â Â Â 1320 Â Â Â 1742 Â Â Â 2683 Â Â Â 1380 Â Â Â 1230 Â Rescheduling interrupts
>> > TLB: Â Â Â Â274 Â Â Â Â 21 Â Â Â Â 19 Â Â Â Â 22 Â Â Â Â 57 Â Â Â Â317 Â Â Â Â131 Â Â Â Â 61 Â TLB shootdowns
>> >
>> > patched (WMARK_HIGH, limited scan)
>> > ----------------------------------
>> > nr_alloc_fail 276
>> > allocstall 54034
>> >
>> > slabs_scanned 24320
>> > kswapd_steal 4507482
>> > kswapd_inodesteal 262
>> > kswapd_low_wmark_hit_quickly 2638
>> > kswapd_high_wmark_hit_quickly 1710
>> > kswapd_skip_congestion_wait 0
>> > pageoutrun 32182
>> >
>> > CAL: Â Â Â Â 69 Â Â Â Â443 Â Â Â Â421 Â Â Â Â567 Â Â Â Â273 Â Â Â Â279 Â Â Â Â269 Â Â Â Â334 Â Function call interrupts
>>
>> Looks amazing.
>
> Yeah, I have strong feelings against drain_all_pages() in the direct
> reclaim path. The intuition is, once drain_all_pages() is called, the
> later on direct reclaims will have less chance to fill the drained
> buffers and therefore forced into drain_all_pages() again and again.
>
> drain_all_pages() is probably an overkill for preventing OOM.
> Generally speaking, it's questionable to "squeeze the last page before
> OOM".
>
> A typical desktop enters thrashing storms before OOM, as Hugh pointed
> out, this may well not the end users wanted. I agree with him and
> personally prefer some applications to be OOM killed rather than the
> whole system goes unusable thrashing like mad.

Tend to agree. The rule is applied to embedded system, too.
Couldn't we mitigate draining  just in case it is high order page.

>
>> > LOC: Â Â 514736 Â Â 511698 Â Â 510993 Â Â 514069 Â Â 514185 Â Â 512986 Â Â 513838 Â Â 511229 Â Local timer interrupts
>> > RES: Â Â Â 2153 Â Â Â 1556 Â Â Â 1126 Â Â Â 1351 Â Â Â 3047 Â Â Â 1554 Â Â Â 1131 Â Â Â 1560 Â Rescheduling interrupts
>> > TLB: Â Â Â Â209 Â Â Â Â 26 Â Â Â Â 20 Â Â Â Â 15 Â Â Â Â 71 Â Â Â Â315 Â Â Â Â117 Â Â Â Â 71 Â TLB shootdowns
>> >
>> > patched (WMARK_HIGH, limited scan, stop on watermark OK), 100 dd
>> > ----------------------------------------------------------------
>> >
>> > start time: 3
>> > total time: 50
>> > nr_alloc_fail 162
>> > allocstall 45523
>> >
>> > CPU Â Â Â Â Â Â count Â Â real total Âvirtual total Â Âdelay total
>> > Â Â Â Â Â Â Â Â Â 921 Â Â 3024540200 Â Â 3009244668 Â Â37123129525
>> > IO Â Â Â Â Â Â Âcount Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0ms
>> > SWAP Â Â Â Â Â Âcount Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0ms
>> > RECLAIM Â Â Â Â count Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â 357 Â Â 4891766796 Â Â Â Â Â Â 13ms
>> > dd: read=0, write=0, cancelled_write=0
>> >
>> > patched (WMARK_HIGH, limited scan, stop on watermark OK), 1000 dd
>> > -----------------------------------------------------------------
>> >
>> > start time: 272
>> > total time: 509
>> > nr_alloc_fail 3913
>> > allocstall 541789
>> >
>> > CPU Â Â Â Â Â Â count Â Â real total Âvirtual total Â Âdelay total
>> > Â Â Â Â Â Â Â Â Â1044 Â Â 3445476208 Â Â 3437200482 Â 229919915202
>> > IO Â Â Â Â Â Â Âcount Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0ms
>> > SWAP Â Â Â Â Â Âcount Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0ms
>> > RECLAIM Â Â Â Â count Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â 452 Â Â34691441605 Â Â Â Â Â Â 76ms
>> > dd: read=0, write=0, cancelled_write=0
>> >
>> > patched (WMARK_HIGH, limited scan, stop on watermark OK, no time limit), 1000 dd
>> > --------------------------------------------------------------------------------
>> >
>> > start time: 278
>> > total time: 513
>> > nr_alloc_fail 4737
>> > allocstall 436392
>> >
>> >
>> > CPU Â Â Â Â Â Â count Â Â real total Âvirtual total Â Âdelay total
>> > Â Â Â Â Â Â Â Â Â1024 Â Â 3371487456 Â Â 3359441487 Â 225088210977
>> > IO Â Â Â Â Â Â Âcount Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â Â 1 Â Â Â160631171 Â Â Â Â Â Â160ms
>> > SWAP Â Â Â Â Â Âcount Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0ms
>> > RECLAIM Â Â Â Â count Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â 367 Â Â30809994722 Â Â Â Â Â Â 83ms
>> > dd: read=20480, write=0, cancelled_write=0
>> >
>> >
>> > no cond_resched():
>>
>> What's this?
>
> I tried a modified patch that also removes the cond_resched() call in
> __alloc_pages_direct_reclaim(), between try_to_free_pages() and
> get_page_from_freelist(). It seems not helping noticeably.
>
> It looks safe to remove that cond_resched() as we already have such
> calls in shrink_page_list().

I tried similar thing but Andrew have a concern about it.
https://lkml.org/lkml/2011/3/24/138

>
>> >
>> > start time: 263
>> > total time: 516
>> > nr_alloc_fail 5144
>> > allocstall 436787
>> >
>> > CPU Â Â Â Â Â Â count Â Â real total Âvirtual total Â Âdelay total
>> > Â Â Â Â Â Â Â Â Â1018 Â Â 3305497488 Â Â 3283831119 Â 241982934044
>> > IO Â Â Â Â Â Â Âcount Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0ms
>> > SWAP Â Â Â Â Â Âcount Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0ms
>> > RECLAIM Â Â Â Â count Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â 328 Â Â31398481378 Â Â Â Â Â Â 95ms
>> > dd: read=0, write=0, cancelled_write=0
>> >
>> > zone_watermark_ok_safe():
>> >
>> > start time: 266
>> > total time: 513
>> > nr_alloc_fail 4526
>> > allocstall 440246
>> >
>> > CPU Â Â Â Â Â Â count Â Â real total Âvirtual total Â Âdelay total
>> > Â Â Â Â Â Â Â Â Â1119 Â Â 3640446568 Â Â 3619184439 Â 240945024724
>> > IO Â Â Â Â Â Â Âcount Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â Â 3 Â Â Â303620082 Â Â Â Â Â Â101ms
>> > SWAP Â Â Â Â Â Âcount Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0ms
>> > RECLAIM Â Â Â Â count Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â 372 Â Â27320731898 Â Â Â Â Â Â 73ms
>> > dd: read=77824, write=0, cancelled_write=0
>> >
>
>> > start time: 275
>>
>> What's meaing of start time?
>
> It's the time taken to start 1000 dd's.
>
>> > total time: 517
>>
>> Total time is elapsed time on your experiment?
>
> Yeah. They are generated with this script.
>
> $ cat ~/bin/test-dd-sparse.sh
>
> #!/bin/sh
>
> mount /dev/sda7 /fs
>
> tic=$(date +'%s')
>
> for i in `seq 1000`
> do
> Â Â Â Âtruncate -s 1G /fs/sparse-$i
> Â Â Â Âdd if=/fs/sparse-$i of=/dev/null &>/dev/null &
> done
>
> tac=$(date +'%s')
> echo start time: $((tac-tic))
>
> wait
>
> tac=$(date +'%s')
> echo total time: $((tac-tic))
>
> egrep '(nr_alloc_fail|allocstall)' /proc/vmstat
> egrep '(CAL|RES|LOC|TLB)' /proc/interrupts
>
>> > nr_alloc_fail 4694
>> > allocstall 431021
>> >
>> >
>> > CPU Â Â Â Â Â Â count Â Â real total Âvirtual total Â Âdelay total
>> > Â Â Â Â Â Â Â Â Â1073 Â Â 3534462680 Â Â 3512544928 Â 234056498221
>>
>> What's meaning of CPU fields?
>
> It's "waiting for a CPU (while being runnable)" as described in
> Documentation/accounting/delay-accounting.txt.

Thanks

>
>> > IO Â Â Â Â Â Â Âcount Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0ms
>> > SWAP Â Â Â Â Â Âcount Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0ms
>> > RECLAIM Â Â Â Â count Â Âdelay total Âdelay average
>> > Â Â Â Â Â Â Â Â Â 386 Â Â34751778363 Â Â Â Â Â Â 89ms
>> > dd: read=0, write=0, cancelled_write=0
>> >
>>
>> Where is vanilla data for comparing latency?
>> Personally, It's hard to parse your data.
>
> Sorry it's somehow too much data and kernel revisions.. The base kernel's
> average latency is 29ms:
>
> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> -------------------------------------------------------------------------------
>
> CPU Â Â Â Â Â Â count Â Â real total Âvirtual total Â Âdelay total
> Â Â Â Â Â Â Â Â 1122 Â Â 3676441096 Â Â 3656793547 Â 274182127286
> IO Â Â Â Â Â Â Âcount Â Âdelay total Âdelay average
> Â Â Â Â Â Â Â Â Â Â3 Â Â Â291765493 Â Â Â Â Â Â 97ms
> SWAP Â Â Â Â Â Âcount Â Âdelay total Âdelay average
> Â Â Â Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0 Â Â Â Â Â Â Â0ms
> RECLAIM Â Â Â Â count Â Âdelay total Âdelay average
> Â Â Â Â Â Â Â Â 1350 Â Â39229752193 Â Â Â Â Â Â 29ms
> dd: read=45056, write=0, cancelled_write=0
>
> start time: 245
> total time: 526
> nr_alloc_fail 14586
> allocstall 1578343
> LOC: Â Â 533981 Â Â 529210 Â Â 528283 Â Â 532346 Â Â 533392 Â Â 531314 Â Â 531705 Â Â 528983 Â Local timer interrupts
> RES: Â Â Â 3123 Â Â Â 2177 Â Â Â 1676 Â Â Â 1580 Â Â Â 2157 Â Â Â 1974 Â Â Â 1606 Â Â Â 1696 Â Rescheduling interrupts
> CAL: Â Â 218392 Â Â 218631 Â Â 219167 Â Â 219217 Â Â 218840 Â Â 218985 Â Â 218429 Â Â 218440 Â Function call interrupts
> TLB: Â Â Â Â175 Â Â Â Â 13 Â Â Â Â 21 Â Â Â Â 18 Â Â Â Â 62 Â Â Â Â309 Â Â Â Â119 Â Â Â Â 42 Â TLB shootdowns
>
>>
>> > CC: Mel Gorman <mel@xxxxxxxxxxxxxxxxxx>
>> > Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
>> > ---
>> > Âfs/buffer.c Â Â Â Â Â| Â Â4 ++--
>> > Âinclude/linux/swap.h | Â Â3 ++-
>> > Âmm/page_alloc.c Â Â Â| Â 20 +++++---------------
>> > Âmm/vmscan.c Â Â Â Â Â| Â 31 +++++++++++++++++++++++--------
>> > Â4 files changed, 32 insertions(+), 26 deletions(-)
>> > --- linux-next.orig/mm/vmscan.c Â Â Â 2011-04-29 10:42:14.000000000 +0800
>> > +++ linux-next/mm/vmscan.c Â Â2011-04-30 21:59:33.000000000 +0800
>> > @@ -2025,8 +2025,9 @@ static bool all_unreclaimable(struct zon
>> > Â * returns: Â0, if no pages reclaimed
>> > Â * Â Â Â Â Â else, the number of pages reclaimed
>> > Â */
>> > -static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct scan_control *sc)
>> > +static unsigned long do_try_to_free_pages(struct zone *preferred_zone,
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct zonelist *zonelist,
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct scan_control *sc)
>> > Â{
>> > Â Â Â int priority;
>> > Â Â Â unsigned long total_scanned = 0;
>> > @@ -2034,6 +2035,7 @@ static unsigned long do_try_to_free_page
>> > Â Â Â struct zoneref *z;
>> > Â Â Â struct zone *zone;
>> > Â Â Â unsigned long writeback_threshold;
>> > + Â Â unsigned long min_reclaim = sc->nr_to_reclaim;
>>
>> Hmm,
>>
>> >
>> > Â Â Â get_mems_allowed();
>> > Â Â Â delayacct_freepages_start();
>> > @@ -2041,6 +2043,9 @@ static unsigned long do_try_to_free_page
>> > Â Â Â if (scanning_global_lru(sc))
>> > Â Â Â Â Â Â Â count_vm_event(ALLOCSTALL);
>> >
>> > + Â Â if (preferred_zone)
>> > + Â Â Â Â Â Â sc->nr_to_reclaim += preferred_zone->watermark[WMARK_HIGH];
>> > +
>>
>> Hmm, I don't like this idea.
>> The goal of direct reclaim path is to reclaim pages asap, I beleive.
>> Many thing should be achieve of background kswapd.
>> If admin changes min_free_kbytes, it can affect latency of direct reclaim.
>> It doesn't make sense to me.
>
> Yeah, it does increase delays.. in the 1000 dd case, roughly from 30ms
> to 90ms. This is a major drawback.

Yes.

>
>> > Â Â Â for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>> > Â Â Â Â Â Â Â sc->nr_scanned = 0;
>> > Â Â Â Â Â Â Â if (!priority)
>> > @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
>> > Â Â Â Â Â Â Â Â Â Â Â }
>> > Â Â Â Â Â Â Â }
>> > Â Â Â Â Â Â Â total_scanned += sc->nr_scanned;
>> > - Â Â Â Â Â Â if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>> > - Â Â Â Â Â Â Â Â Â Â goto out;
>> > + Â Â Â Â Â Â if (sc->nr_reclaimed >= min_reclaim) {
>> > + Â Â Â Â Â Â Â Â Â Â if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â goto out;
>>
>> I can't understand the logic.
>> if nr_reclaimed is bigger than min_reclaim, it's always greater than
>> nr_to_reclaim. What's meaning of min_reclaim?
>
> In direct reclaim, min_reclaim will be the legacy SWAP_CLUSTER_MAX and
> sc->nr_to_reclaim will be increased to the zone's high watermark and
> is kind of "max to reclaim".
>
>>
>> > + Â Â Â Â Â Â Â Â Â Â if (total_scanned > 2 * sc->nr_to_reclaim)
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â goto out;
>>
>> If there are lots of dirty pages in LRU?
>> If there are lots of unevictable pages in LRU?
>> If there are lots of mapped page in LRU but may_unmap = 0 cases?
>> I means it's rather risky early conclusion.
>
> That test means to avoid scanning too much on __GFP_NORETRY direct
> reclaims. My assumption for __GFP_NORETRY is, it should fail fast when
> the LRU pages seem hard to reclaim. And the problem in the 1000 dd
> case is, it's all easy to reclaim LRU pages but __GFP_NORETRY still
> fails from time to time, with lots of IPIs that may hurt large
> machines a lot.

I don't have  enough time and a environment to test it.
So I can't make sure of it but my concern is a latency.
If you solve latency problem considering CPU scaling, I won't oppose it. :)

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href