Johannes Weiner <hannes@xxxxxxxxxxx> writes: > On Wed, Sep 27, 2023 at 01:42:25PM +0800, Huang, Ying wrote: >> Johannes Weiner <hannes@xxxxxxxxxxx> writes: >> >> > The idea behind the cache is to save get_pageblock_migratetype() >> > lookups during bulk freeing. A microbenchmark suggests this isn't >> > helping, though. The pcp migratetype can get stale, which means that >> > bulk freeing has an extra branch to check if the pageblock was >> > isolated while on the pcp. >> > >> > While the variance overlaps, the cache write and the branch seem to >> > make this a net negative. The following test allocates and frees >> > batches of 10,000 pages (~3x the pcp high marks to trigger flushing): >> > >> > Before: >> > 8,668.48 msec task-clock # 99.735 CPUs utilized ( +- 2.90% ) >> > 19 context-switches # 4.341 /sec ( +- 3.24% ) >> > 0 cpu-migrations # 0.000 /sec >> > 17,440 page-faults # 3.984 K/sec ( +- 2.90% ) >> > 41,758,692,473 cycles # 9.541 GHz ( +- 2.90% ) >> > 126,201,294,231 instructions # 5.98 insn per cycle ( +- 2.90% ) >> > 25,348,098,335 branches # 5.791 G/sec ( +- 2.90% ) >> > 33,436,921 branch-misses # 0.26% of all branches ( +- 2.90% ) >> > >> > 0.0869148 +- 0.0000302 seconds time elapsed ( +- 0.03% ) >> > >> > After: >> > 8,444.81 msec task-clock # 99.726 CPUs utilized ( +- 2.90% ) >> > 22 context-switches # 5.160 /sec ( +- 3.23% ) >> > 0 cpu-migrations # 0.000 /sec >> > 17,443 page-faults # 4.091 K/sec ( +- 2.90% ) >> > 40,616,738,355 cycles # 9.527 GHz ( +- 2.90% ) >> > 126,383,351,792 instructions # 6.16 insn per cycle ( +- 2.90% ) >> > 25,224,985,153 branches # 5.917 G/sec ( +- 2.90% ) >> > 32,236,793 branch-misses # 0.25% of all branches ( +- 2.90% ) >> > >> > 0.0846799 +- 0.0000412 seconds time elapsed ( +- 0.05% ) >> > >> > A side effect is that this also ensures that pages whose pageblock >> > gets stolen while on the pcplist end up on the right freelist and we >> > don't perform potentially type-incompatible buddy merges (or skip >> > merges when we shouldn't), whis is likely beneficial to long-term >> > fragmentation management, although the effects would be harder to >> > measure. Settle for simpler and faster code as justification here. >> >> I suspected the PCP allocating/freeing path may be influenced (that is, >> allocating/freeing batch is less than PCP high). So I tested >> one-process will-it-scale/page_fault1 with sysctl >> percpu_pagelist_high_fraction=8. So pages will be allocated/freed >> from/to PCP only. The test results are as follows, >> >> Before: >> will-it-scale.1.processes 618364.3 (+- 0.075%) >> perf-profile.children.get_pfnblock_flags_mask 0.13 (+- 9.350%) >> >> After: >> will-it-scale.1.processes 616512.0 (+- 0.057%) >> perf-profile.children.get_pfnblock_flags_mask 0.41 (+- 22.44%) >> >> The change isn't large: -0.3%. Perf profiling shows the cycles% of >> get_pfnblock_flags_mask() increases. > > Ah, this is going through the free_unref_page_list() path that > Vlastimil had pointed out as well. I made another change on top that > eliminates the second lookup. After that, both pcp fast paths have the > same number of lookups as before: 1. This fixes the regression for me. > > Would you mind confirming this as well? I have done more test for the series and addon patches. The test results are as follows, base perf-profile.children.get_pfnblock_flags_mask 0.15 (+- 32.62%) will-it-scale.1.processes 618621.7 (+- 0.18%) mm: page_alloc: remove pcppage migratetype caching perf-profile.children.get_pfnblock_flags_mask 0.40 (+- 21.55%) will-it-scale.1.processes 616350.3 (+- 0.27%) mm: page_alloc: fix up block types when merging compatible blocks perf-profile.children.get_pfnblock_flags_mask 0.36 (+- 8.36%) will-it-scale.1.processes 617121.0 (+- 0.17%) mm: page_alloc: move free pages when converting block during isolation perf-profile.children.get_pfnblock_flags_mask 0.36 (+- 15.10%) will-it-scale.1.processes 615578.0 (+- 0.18%) mm: page_alloc: fix move_freepages_block() range error perf-profile.children.get_pfnblock_flags_mask 0.36 (+- 12.78%) will-it-scale.1.processes 615364.7 (+- 0.27%) mm: page_alloc: fix freelist movement during block conversion perf-profile.children.get_pfnblock_flags_mask 0.36 (+- 10.52%) will-it-scale.1.processes 617834.8 (+- 0.52%) mm: page_alloc: consolidate free page accounting perf-profile.children.get_pfnblock_flags_mask 0.39 (+- 8.27%) will-it-scale.1.processes 621000.0 (+- 0.13%) mm: page_alloc: close migratetype race between freeing and stealing perf-profile.children.get_pfnblock_flags_mask 0.37 (+- 5.87%) will-it-scale.1.processes 618378.8 (+- 0.17%) mm: page_alloc: optimize free_unref_page_list() perf-profile.children.get_pfnblock_flags_mask 0.20 (+- 14.96%) will-it-scale.1.processes 618136.3 (+- 0.16%) It seems that the will-it-scale score is influenced by some other factors too. But anyway, the series + addon patches restores the score of will-it-scale. And the cycles% of get_pfnblock_flags_mask() is almost restored by the final patch (mm: page_alloc: optimize free_unref_page_list()). Feel free to add my "Tested-by" for these patches. -- Best Regards, Huang, Ying