Hi, Nathan, Thanks for your information! That's valuable. Nathan Chancellor <nathan@xxxxxxxxxx> writes: > Hi Ying, > > On Wed, Oct 19, 2022 at 10:05:50AM +0800, Huang, Ying wrote: >> Hi, Yujie, >> >> > 32528 48% +147.6% 80547 38% numa-meminfo.node0.AnonHugePages >> > 92821 23% +59.3% 147839 28% numa-meminfo.node0.AnonPages >> >> The Anon pages allocated are much more than the parent commit. This is >> expected, because THP instead of normal page will be allocated for >> aligned memory area. >> >> > 95.23 -79.8 15.41 6% perf-profile.calltrace.cycles-pp.__munmap >> > 95.08 -79.7 15.40 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap >> > 95.02 -79.6 15.39 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap >> > 94.96 -79.6 15.37 6% perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap >> > 94.95 -79.6 15.37 6% perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap >> > 94.86 -79.5 15.35 6% perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe >> > 94.38 -79.2 15.22 6% perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64 >> > 42.74 -42.7 0.00 perf-profile.calltrace.cycles-pp.lru_add_drain.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap >> > 42.74 -42.7 0.00 perf-profile.calltrace.cycles-pp.lru_add_drain_cpu.lru_add_drain.unmap_region.__do_munmap.__vm_munmap >> > 42.72 -42.7 0.00 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.lru_add_drain_cpu.lru_add_drain.unmap_region.__do_munmap >> > 41.84 -41.8 0.00 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.lru_add_drain.unmap_region >> > 41.70 -41.7 0.00 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.lru_add_drain >> > 41.62 -41.6 0.00 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush.tlb_finish_mmu.unmap_region >> > 41.55 -41.6 0.00 >> > perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu >> > 41.52 -41.5 0.00 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush.tlb_finish_mmu >> > 41.28 -41.3 0.00 >> > perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush >> >> In the parent commit, most CPU cycles are used for contention on LRU lock. >> >> > 0.00 +4.8 4.82 7% perf-profile.calltrace.cycles-pp.do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault >> > 0.00 +4.9 4.88 7% perf-profile.calltrace.cycles-pp.zap_huge_pmd.zap_pmd_range.unmap_page_range.unmap_vmas.unmap_region >> > 0.00 +8.2 8.22 8% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist >> > 0.00 +8.2 8.23 8% perf-profile.calltrace.cycles-pp._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages >> > 0.00 +8.3 8.35 8% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_pcppages_bulk.free_unref_page.release_pages >> > 0.00 +8.3 8.35 8% perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.free_unref_page.release_pages.tlb_batch_pages_flush >> > 0.00 +8.4 8.37 8% perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page.release_pages.tlb_batch_pages_flush.tlb_finish_mmu >> > 0.00 +9.6 9.60 6% perf-profile.calltrace.cycles-pp.free_unref_page.release_pages.tlb_batch_pages_flush.tlb_finish_mmu.unmap_region >> > 0.00 +65.5 65.48 2% perf-profile.calltrace.cycles-pp.clear_page_erms.clear_huge_page.__do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault >> > 0.00 +72.5 72.51 2% perf-profile.calltrace.cycles-pp.clear_huge_page.__do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault >> >> With the commit, most CPU cycles are consumed for clear huge page. This >> is expected. We allocate more pages, so, we need more cycles to clear >> them. >> >> Check the source code of test case (will-it-scale/malloc1), I found that >> it will allocate some memory with malloc() then free it. >> >> In the parent commit, because the virtual memory address isn't aligned >> with 2M, normal page will be allocated. With the commit, THP will be >> allocated, so more page clearing and less LRU lock contention. I think >> this is the expected behavior of the commit. And the test case isn't so >> popular (malloc() then free() but don't access the memory allocated). So >> this regression isn't important. We can just ignore it. > > For what it's worth, I just bisected a massive and visible performance > regression on my Threadripper 3990X workstation to commit f35b5d7d676e > ("mm: align larger anonymous mappings on THP boundaries"), which seems > directly related to this report/analysis. I initially noticed this > because my full set of kernel builds against mainline went from 2 hours > and 20 minutes or so to over 3 hours. Zeroing in on x86_64 allmodconfig, > which I used for the bisect: > > @ 7b5a0b664ebe ("mm/page_ext: remove unused variable in offline_page_ext"): > > Benchmark 1: make -skj128 LLVM=1 allmodconfig all > Time (mean ± σ): 318.172 s ± 0.730 s [User: 31750.902 s, System: 4564.246 s] > Range (min … max): 317.332 s … 318.662 s 3 runs > > @ f35b5d7d676e ("mm: align larger anonymous mappings on THP boundaries"): > > Benchmark 1: make -skj128 LLVM=1 allmodconfig all Have you tried to build with gcc? Want to check whether is this clang specific issue or not. Best Regards, Huang, Ying > Time (mean ± σ): 406.688 s ± 0.676 s [User: 31819.526 s, System: 16327.022 s] > Range (min … max): 405.954 s … 407.284 s 3 run > > That is a pretty big difference (27%), which is visible while doing a > lot of builds, only because of the extra system time. If there is any > way to improve this, it should certainly be considered. > > For now, I'll just revert it locally. > > Cheers, > Nathan > > # bad: [aae703b02f92bde9264366c545e87cec451de471] Merge tag 'for-6.1-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux > # good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0 > git bisect start 'aae703b02f92bde9264366c545e87cec451de471' 'v6.0' > # good: [18fd049731e67651009f316195da9281b756f2cf] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux > git bisect good 18fd049731e67651009f316195da9281b756f2cf > # good: [ab0c23b535f3f9d8345d8ad4c18c0a8594459d55] MAINTAINERS: add RISC-V's patchwork > git bisect good ab0c23b535f3f9d8345d8ad4c18c0a8594459d55 > # bad: [f721d24e5dae8358b49b24399d27ba5d12a7e049] Merge tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs > git bisect bad f721d24e5dae8358b49b24399d27ba5d12a7e049 > # good: [ada3bfb6492a6d0d3eca50f3b61315fe032efc72] Merge tag 'tpmdd-next-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd > git bisect good ada3bfb6492a6d0d3eca50f3b61315fe032efc72 > # bad: [4e07acdda7fc23f5c4666e54961ef972a1195ffd] mm/hwpoison: add __init/__exit annotations to module init/exit funcs > git bisect bad 4e07acdda7fc23f5c4666e54961ef972a1195ffd > # bad: [000a449345bbb4ffbd880f7143b5fb4acac34121] radix tree test suite: add allocation counts and size to kmem_cache > git bisect bad 000a449345bbb4ffbd880f7143b5fb4acac34121 > # bad: [47d55419951312d723de1b6ad53ee92948b8eace] btrfs: convert process_page_range() to use filemap_get_folios_contig() > git bisect bad 47d55419951312d723de1b6ad53ee92948b8eace > # bad: [4d86d4f7227c6f2acfbbbe0623d49865aa71b756] mm: add more BUILD_BUG_ONs to gfp_migratetype() > git bisect bad 4d86d4f7227c6f2acfbbbe0623d49865aa71b756 > # bad: [816284a3d0e27828b5cc35f3cf539b0711939ce3] userfaultfd: update documentation to describe /dev/userfaultfd > git bisect bad 816284a3d0e27828b5cc35f3cf539b0711939ce3 > # good: [be6667b0db97e10b2a6d57a906c2c8fd2b985e5e] selftests/vm: dedup hugepage allocation logic > git bisect good be6667b0db97e10b2a6d57a906c2c8fd2b985e5e > # bad: [2ace36f0f55777be8a871c370832527e1cd54b15] mm: memory-failure: cleanup try_to_split_thp_page() > git bisect bad 2ace36f0f55777be8a871c370832527e1cd54b15 > # good: [9d0d946840075e0268f4f77fe39ba0f53e84c7c4] selftests/vm: add selftest to verify multi THP collapse > git bisect good 9d0d946840075e0268f4f77fe39ba0f53e84c7c4 > # bad: [f35b5d7d676e59e401690b678cd3cfec5e785c23] mm: align larger anonymous mappings on THP boundaries > git bisect bad f35b5d7d676e59e401690b678cd3cfec5e785c23 > # good: [7b5a0b664ebe2625965a0fdba2614c88c4b9bbc6] mm/page_ext: remove unused variable in offline_page_ext > git bisect good 7b5a0b664ebe2625965a0fdba2614c88c4b9bbc6 > # first bad commit: [f35b5d7d676e59e401690b678cd3cfec5e785c23] mm: align larger anonymous mappings on THP boundaries