Re: [mm] f35b5d7d67: will-it-scale.per_process_ops -95.5% regression

"Huang, Ying" <ying.huang@xxxxxxxxx> · Thu, 20 Oct 2022 13:07:09 +0800

Hi, Nathan,

Thanks for your information!  That's valuable.

Nathan Chancellor <nathan@xxxxxxxxxx> writes:

> Hi Ying,
>
> On Wed, Oct 19, 2022 at 10:05:50AM +0800, Huang, Ying wrote:
>> Hi, Yujie,
>> 
>> >      32528  48%    +147.6%      80547  38%  numa-meminfo.node0.AnonHugePages
>> >      92821  23%     +59.3%     147839  28%  numa-meminfo.node0.AnonPages
>> 
>> The Anon pages allocated are much more than the parent commit.  This is
>> expected, because THP instead of normal page will be allocated for
>> aligned memory area.
>> 
>> >      95.23           -79.8       15.41   6%  perf-profile.calltrace.cycles-pp.__munmap
>> >      95.08           -79.7       15.40   6%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap
>> >      95.02           -79.6       15.39   6%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
>> >      94.96           -79.6       15.37   6%  perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
>> >      94.95           -79.6       15.37   6%  perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
>> >      94.86           -79.5       15.35   6%  perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
>> >      94.38           -79.2       15.22   6%  perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
>> >      42.74           -42.7        0.00        perf-profile.calltrace.cycles-pp.lru_add_drain.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
>> >      42.74           -42.7        0.00        perf-profile.calltrace.cycles-pp.lru_add_drain_cpu.lru_add_drain.unmap_region.__do_munmap.__vm_munmap
>> >      42.72           -42.7        0.00        perf-profile.calltrace.cycles-pp.folio_batch_move_lru.lru_add_drain_cpu.lru_add_drain.unmap_region.__do_munmap
>> >      41.84           -41.8        0.00        perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.lru_add_drain.unmap_region
>> >      41.70           -41.7        0.00        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.lru_add_drain
>> >      41.62           -41.6        0.00        perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush.tlb_finish_mmu.unmap_region
>> >      41.55 -41.6 0.00
>> > perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu
>> >      41.52           -41.5        0.00        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush.tlb_finish_mmu
>> >      41.28 -41.3 0.00
>> > perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush
>> 
>> In the parent commit, most CPU cycles are used for contention on LRU lock.
>> 
>> >       0.00            +4.8        4.82   7%  perf-profile.calltrace.cycles-pp.do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
>> >       0.00            +4.9        4.88   7%  perf-profile.calltrace.cycles-pp.zap_huge_pmd.zap_pmd_range.unmap_page_range.unmap_vmas.unmap_region
>> >       0.00            +8.2        8.22   8%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist
>> >       0.00            +8.2        8.23   8%  perf-profile.calltrace.cycles-pp._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages
>> >       0.00            +8.3        8.35   8%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_pcppages_bulk.free_unref_page.release_pages
>> >       0.00            +8.3        8.35   8%  perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.free_unref_page.release_pages.tlb_batch_pages_flush
>> >       0.00            +8.4        8.37   8%  perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page.release_pages.tlb_batch_pages_flush.tlb_finish_mmu
>> >       0.00            +9.6        9.60   6%  perf-profile.calltrace.cycles-pp.free_unref_page.release_pages.tlb_batch_pages_flush.tlb_finish_mmu.unmap_region
>> >       0.00           +65.5       65.48   2%  perf-profile.calltrace.cycles-pp.clear_page_erms.clear_huge_page.__do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault
>> >       0.00           +72.5       72.51   2%  perf-profile.calltrace.cycles-pp.clear_huge_page.__do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
>> 
>> With the commit, most CPU cycles are consumed for clear huge page.  This
>> is expected.  We allocate more pages, so, we need more cycles to clear
>> them.
>> 
>> Check the source code of test case (will-it-scale/malloc1), I found that
>> it will allocate some memory with malloc() then free it.
>> 
>> In the parent commit, because the virtual memory address isn't aligned
>> with 2M, normal page will be allocated.  With the commit, THP will be
>> allocated, so more page clearing and less LRU lock contention.  I think
>> this is the expected behavior of the commit.  And the test case isn't so
>> popular (malloc() then free() but don't access the memory allocated).  So
>> this regression isn't important.  We can just ignore it.
>
> For what it's worth, I just bisected a massive and visible performance
> regression on my Threadripper 3990X workstation to commit f35b5d7d676e
> ("mm: align larger anonymous mappings on THP boundaries"), which seems
> directly related to this report/analysis. I initially noticed this
> because my full set of kernel builds against mainline went from 2 hours
> and 20 minutes or so to over 3 hours. Zeroing in on x86_64 allmodconfig,
> which I used for the bisect:
>
> @ 7b5a0b664ebe ("mm/page_ext: remove unused variable in offline_page_ext"):
>
> Benchmark 1: make -skj128 LLVM=1 allmodconfig all
>   Time (mean ± σ):     318.172 s ±  0.730 s    [User: 31750.902 s, System: 4564.246 s]
>   Range (min … max):   317.332 s … 318.662 s    3 runs
>
> @ f35b5d7d676e ("mm: align larger anonymous mappings on THP boundaries"):
>
> Benchmark 1: make -skj128 LLVM=1 allmodconfig all

Have you tried to build with gcc?  Want to check whether is this clang
specific issue or not.

Best Regards,
Huang, Ying

>   Time (mean ± σ):     406.688 s ±  0.676 s    [User: 31819.526 s, System: 16327.022 s]
>   Range (min … max):   405.954 s … 407.284 s    3 run
>
> That is a pretty big difference (27%), which is visible while doing a
> lot of builds, only because of the extra system time. If there is any
> way to improve this, it should certainly be considered.
>
> For now, I'll just revert it locally.
>
> Cheers,
> Nathan
>
> # bad: [aae703b02f92bde9264366c545e87cec451de471] Merge tag 'for-6.1-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
> # good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0
> git bisect start 'aae703b02f92bde9264366c545e87cec451de471' 'v6.0'
> # good: [18fd049731e67651009f316195da9281b756f2cf] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
> git bisect good 18fd049731e67651009f316195da9281b756f2cf
> # good: [ab0c23b535f3f9d8345d8ad4c18c0a8594459d55] MAINTAINERS: add RISC-V's patchwork
> git bisect good ab0c23b535f3f9d8345d8ad4c18c0a8594459d55
> # bad: [f721d24e5dae8358b49b24399d27ba5d12a7e049] Merge tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
> git bisect bad f721d24e5dae8358b49b24399d27ba5d12a7e049
> # good: [ada3bfb6492a6d0d3eca50f3b61315fe032efc72] Merge tag 'tpmdd-next-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
> git bisect good ada3bfb6492a6d0d3eca50f3b61315fe032efc72
> # bad: [4e07acdda7fc23f5c4666e54961ef972a1195ffd] mm/hwpoison: add __init/__exit annotations to module init/exit funcs
> git bisect bad 4e07acdda7fc23f5c4666e54961ef972a1195ffd
> # bad: [000a449345bbb4ffbd880f7143b5fb4acac34121] radix tree test suite: add allocation counts and size to kmem_cache
> git bisect bad 000a449345bbb4ffbd880f7143b5fb4acac34121
> # bad: [47d55419951312d723de1b6ad53ee92948b8eace] btrfs: convert process_page_range() to use filemap_get_folios_contig()
> git bisect bad 47d55419951312d723de1b6ad53ee92948b8eace
> # bad: [4d86d4f7227c6f2acfbbbe0623d49865aa71b756] mm: add more BUILD_BUG_ONs to gfp_migratetype()
> git bisect bad 4d86d4f7227c6f2acfbbbe0623d49865aa71b756
> # bad: [816284a3d0e27828b5cc35f3cf539b0711939ce3] userfaultfd: update documentation to describe /dev/userfaultfd
> git bisect bad 816284a3d0e27828b5cc35f3cf539b0711939ce3
> # good: [be6667b0db97e10b2a6d57a906c2c8fd2b985e5e] selftests/vm: dedup hugepage allocation logic
> git bisect good be6667b0db97e10b2a6d57a906c2c8fd2b985e5e
> # bad: [2ace36f0f55777be8a871c370832527e1cd54b15] mm: memory-failure: cleanup try_to_split_thp_page()
> git bisect bad 2ace36f0f55777be8a871c370832527e1cd54b15
> # good: [9d0d946840075e0268f4f77fe39ba0f53e84c7c4] selftests/vm: add selftest to verify multi THP collapse
> git bisect good 9d0d946840075e0268f4f77fe39ba0f53e84c7c4
> # bad: [f35b5d7d676e59e401690b678cd3cfec5e785c23] mm: align larger anonymous mappings on THP boundaries
> git bisect bad f35b5d7d676e59e401690b678cd3cfec5e785c23
> # good: [7b5a0b664ebe2625965a0fdba2614c88c4b9bbc6] mm/page_ext: remove unused variable in offline_page_ext
> git bisect good 7b5a0b664ebe2625965a0fdba2614c88c4b9bbc6
> # first bad commit: [f35b5d7d676e59e401690b678cd3cfec5e785c23] mm: align larger anonymous mappings on THP boundaries