On 11/7/24 15:10, kernel test robot wrote: > > > Hello, > > kernel test robot noticed a 3888.9% improvement of will-it-scale.per_process_ops on: > > > commit: d4148aeab412432bf928f311eca8a2ba52bb05df ("mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes") > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > > [test failed on linux-next/master 5b913f5d7d7fe0f567dea8605f21da6eaa1735fb] > > testcase: will-it-scale > config: x86_64-rhel-8.3 > compiler: gcc-12 > test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory > parameters: > > nr_task: 100% > mode: process > test: malloc1 > cpufreq_governor: performance Since this report is now linked from youtube videos with 200k views, maybe someone will appreciate reading about what's actually happening here, in more detail that would be normally sufficient for linux-mm. Thanks to Pedro who initially brought up what malloc1 does on IRC. The test [1] just performs repeatedly a malloc()/free() of a 128MB large buffer. #define SIZE (128UL * 1024 * 1024) while(1) { void *addr = malloc(SIZE); assert(addr != NULL); free(addr); } We can compile the same code outside of the test, without the iterations, but printing the addr and pausing the process so we can look at /proc/$pid/smaps. Before the commit was applied I get: addr: 0x7f0f62a00010 And the corresponding area in smaps is: 7f0f62a00000-7f0f6aa01000 rw-p 00000000 00:00 0 Size: 131076 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 2048 kB Pss: 2048 kB Pss_Dirty: 2048 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 2048 kB Referenced: 2048 kB Anonymous: 2048 kB KSM: 0 kB LazyFree: 0 kB AnonHugePages: 2048 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 We can notice the 128MB become 128MB+4kB and he address returned by malloc() is offset by 16bytes (0x10) from the mmapped aread. So we have 7f0f62a00000 - start of mmapped area 7f0f62a00010 - start of malloc()-returned buffer 7f0f6aa00010 - end of malloc()'d buffer (128MB) 7f0f6aa01000 - end of mmapped area malloc() AFAIK would normally manage some large-ish arena allocated by a single mmap() and give out addresses to smaller allocations from that, but any request as large as 128MB is turned directly into a mmap(). But malloc() has to remember how large the allocation was in order to turn free(addr) into an appropriate munmap(addr, size), so it prepends the 16 byte header to store the size, but any mmap() needs to be rounded up to page size so this 16 bytes become 4kB and the mmap() is not exactly 128MB but 4kB larger. Before commit d4148aeab412, due to commit efa7df3e3bb5 the implementation is that mmap() >= 2MB will have the start of the area aligned to 2MB, which we can see from the address 7f0f62a00000 which is divisible by 2MB. The fields Rss: 2048 kB and AnonHugePages: 2048 kB above also reveal that when malloc() wrote its 16 byte header, a transparent hugepage (THP) was allocated for the first 2MB of the area. After commit d4148aeab412 we get addr: 0x7fd7b9998010 7fd7b9998000-7fd7c199c000 rw-p 00000000 00:00 0 Size: 131088 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 12 kB Pss: 12 kB Pss_Dirty: 12 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 12 kB Referenced: 12 kB Anonymous: 12 kB KSM: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 Again mmap() was done with an extra page for the 16byte header, but the area is no longer aligned to 2MB because the size of the area is not multiple of 2MB. Thus the header cannot be populated with a 2MB THP but only a 4kB base page. And that's the whole difference. The test allocates 128MB via malloc() but doesn't actually touch the memory at all. So only the page with the header is faulted, and obviously it's much faster to allocate and clear (write with zeros, as the kernel has to do for userspace pages) a 4kB page instead of 2MB. But it's an artifact of the benchmark - we can assume a program that allocates memory with malloc() would also use it, and then most of the 128MB + 4kB would get backed with THPs and the differences would largely disappear. Also you might be wondering - if improvement was reported for malloc1 due to d4148aeab412432, shouldn't have a regression been reported in the past for efa7df3e3bb5? In fact it was, actually a multiple regressions for various prior versions of the patch that eventually became commit efa7df3e3bb5, but this particular one in malloc1 was (rightfully) dismissed as not important enough [2] [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/malloc1.c [2] https://lore.kernel.org/all/87edv4r2ip.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > In addition to that, the commit also has significant impact on the following tests: > > +------------------+---------------------------------------------------------------------------------------------+ > | testcase: change | stress-ng: stress-ng.radixsort.ops_per_sec 9.2% regression | BTW, nobody seems to have noticed this part, which looks like a regression report :) > | test machine | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory | > | test parameters | cpufreq_governor=performance | > | | nr_threads=100% | > | | test=radixsort | > | | testtime=60s | > +------------------+---------------------------------------------------------------------------------------------+ > > > > > Details are as below: > --------------------------------------------------------------------------------------------------> > > > The kernel config and materials to reproduce are available at: > https://download.01.org/0day-ci/archive/20241107/202411072132.a8d2cf0f-oliver.sang@xxxxxxxxx > > ========================================================================================= > compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: > gcc-12/performance/x86_64-rhel-8.3/process/100%/debian-12-x86_64-20240206.cgz/lkp-cpl-4sp2/malloc1/will-it-scale > > commit: > 15e8156713 ("mm: shrinker: avoid memleak in alloc_shrinker_info") > d4148aeab4 ("mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes") > > 15e8156713cc3803 d4148aeab412432bf928f311eca > ---------------- --------------------------- > %stddev %change %stddev > \ | \ <snip> > 89.46 -89.3 0.20 ± 22% perf-profile.children.cycles-pp.clear_page_erms This confirms we are clearing much less memory (4KB instead of 2MB) because we spend way less time in the clearing function. > *************************************************************************************************** > lkp-spr-r02: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory > ========================================================================================= > compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime: > gcc-12/performance/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-spr-r02/radixsort/stress-ng/60s And this is the regression report of another benchmark that nobody noticed. > commit: > 15e8156713 ("mm: shrinker: avoid memleak in alloc_shrinker_info") > d4148aeab4 ("mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes") > > 15e8156713cc3803 d4148aeab412432bf928f311eca > ---------------- --------------------------- > %stddev %change %stddev > \ | \ > 18.98 ± 5% +30.5% 24.78 ± 14% sched_debug.cpu.clock.stddev > 0.05 ± 7% +886.3% 0.50 ± 86% perf-sched.sch_delay.avg.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity > 1.14 ± 74% -55.5% 0.51 ± 37% perf-sched.sch_delay.avg.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown] > 265008 ± 13% -37.2% 166403 ± 22% meminfo.Active > 264976 ± 13% -37.2% 166371 ± 22% meminfo.Active(anon) > 1315990 -63.8% 475807 ± 3% meminfo.AnonHugePages So we get fewer THPs but unlike the malloc1 test, this one seems to be benefiting from them, so having fewer of them causes a regression. Maybe it does mmap()'s of some unfortunate size like 3MB? Before commit d4148aeab412 such mmap() would be guaranteed a THP for the first 2MB and no THP's for the other 1MB, i.e.: | [ 2 MB | 1 MB ] But now the result can be [ 1.5MB | 1.5 MB ] (where | is a 2MB boundary) hence no THP backing at all. This probably just shows that no heuristic can be optimal for every possible use case. > 164713 -9.2% 149574 stress-ng.radixsort.ops > 2745 -9.2% 2492 stress-ng.radixsort.ops_per_sec > 78925 +289.9% 307739 stress-ng.time.minor_page_faults > 327.88 ± 3% -67.1% 107.89 ± 3% numa-vmstat.node0.nr_anon_transparent_hugepages > 63711 ± 11% -40.7% 37791 ± 25% numa-vmstat.node1.nr_active_anon > 317.83 ± 3% -60.5% 125.70 ± 8% numa-vmstat.node1.nr_anon_transparent_hugepages > 293623 ± 18% +35.1% 396637 ± 14% numa-vmstat.node1.nr_inactive_anon > 63710 ± 11% -40.7% 37791 ± 25% numa-vmstat.node1.nr_zone_active_anon > 293622 ± 18% +35.1% 396635 ± 14% numa-vmstat.node1.nr_zone_inactive_anon > 668445 ± 3% -67.1% 220222 ± 3% numa-meminfo.node0.AnonHugePages > 1317034 ± 13% -23.2% 1010970 ± 16% numa-meminfo.node0.AnonPages.max > 254241 ± 11% -40.5% 151233 ± 26% numa-meminfo.node1.Active > 254219 ± 11% -40.5% 151227 ± 26% numa-meminfo.node1.Active(anon) > 647491 ± 3% -60.4% 256719 ± 8% numa-meminfo.node1.AnonHugePages > 1170941 ± 18% +35.3% 1583801 ± 14% numa-meminfo.node1.Inactive > 1170783 ± 18% +35.3% 1583760 ± 14% numa-meminfo.node1.Inactive(anon) > 66923 ± 13% -36.6% 42442 ± 22% proc-vmstat.nr_active_anon > 642.60 -63.8% 232.40 ± 3% proc-vmstat.nr_anon_transparent_hugepages > 597472 ± 2% +5.6% 630776 proc-vmstat.nr_inactive_anon > 66923 ± 13% -36.6% 42442 ± 22% proc-vmstat.nr_zone_active_anon > 597472 ± 2% +5.6% 630776 proc-vmstat.nr_zone_inactive_anon > 1188533 +19.2% 1416356 proc-vmstat.numa_hit > 2502 -63.5% 913.67 proc-vmstat.numa_huge_pte_updates > 956729 +23.2% 1178316 proc-vmstat.numa_local > 1495630 ± 2% -54.0% 687360 ± 6% proc-vmstat.numa_pte_updates > 820025 +26.8% 1040066 proc-vmstat.pgfault > 6.20 +9.6% 6.79 perf-stat.i.MPKI > 4.203e+10 -6.8% 3.916e+10 perf-stat.i.branch-instructions > 6.15 +0.2 6.30 perf-stat.i.branch-miss-rate% > 2.636e+09 -6.2% 2.474e+09 perf-stat.i.branch-misses > 2.56 ± 2% +8.5% 2.77 perf-stat.i.cpi > 2.444e+11 -6.9% 2.276e+11 perf-stat.i.instructions > 0.40 -9.1% 0.36 perf-stat.i.ipc > 2.20 ± 2% +90.0% 4.18 ± 2% perf-stat.i.metric.K/sec > 10808 +45.4% 15713 ± 2% perf-stat.i.minor-faults > 10809 +45.4% 15713 ± 2% perf-stat.i.page-faults > 6.34 +7.9% 6.84 perf-stat.overall.MPKI > 37.76 -0.6 37.12 perf-stat.overall.cache-miss-rate% > 2.54 +9.9% 2.80 perf-stat.overall.cpi > 401.20 +1.8% 408.57 perf-stat.overall.cycles-between-cache-misses > 0.39 -9.0% 0.36 perf-stat.overall.ipc > 4.134e+10 ± 2% -7.4% 3.829e+10 perf-stat.ps.branch-instructions > 2.592e+09 -6.7% 2.419e+09 perf-stat.ps.branch-misses > 2.404e+11 ± 2% -7.4% 2.226e+11 perf-stat.ps.instructions > 10358 +36.2% 14109 ± 2% perf-stat.ps.minor-faults > 10358 +36.2% 14109 ± 2% perf-stat.ps.page-faults > 1.525e+13 -9.9% 1.374e+13 perf-stat.total.instructions > 0.60 ± 10% -0.3 0.28 ±100% perf-profile.calltrace.cycles-pp.__perf_mmap__read_init.perf_mmap__read_init.perf_mmap__push.record__mmap_read_evlist.__cmd_record > 0.71 ± 4% +0.1 0.77 ± 3% perf-profile.calltrace.cycles-pp.update_load_avg.task_tick_fair.sched_tick.update_process_times.tick_nohz_handler > 1.25 ± 9% +0.3 1.54 ± 4% perf-profile.calltrace.cycles-pp.update_curr.task_tick_fair.sched_tick.update_process_times.tick_nohz_handler > 0.26 ±100% +0.3 0.57 ± 3% perf-profile.calltrace.cycles-pp.__irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt > 2.64 ± 6% +0.4 3.09 ± 3% perf-profile.calltrace.cycles-pp.task_tick_fair.sched_tick.update_process_times.tick_nohz_handler.__hrtimer_run_queues > 0.08 ±223% +0.5 0.55 ± 4% perf-profile.calltrace.cycles-pp.account_user_time.update_process_times.tick_nohz_handler.__hrtimer_run_queues.hrtimer_interrupt > 5.00 ± 6% +0.7 5.75 ± 3% perf-profile.calltrace.cycles-pp.sched_tick.update_process_times.tick_nohz_handler.__hrtimer_run_queues.hrtimer_interrupt > 7.66 ± 5% +1.1 8.72 ± 3% perf-profile.calltrace.cycles-pp.update_process_times.tick_nohz_handler.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt > 8.10 ± 5% +1.1 9.18 ± 3% perf-profile.calltrace.cycles-pp.tick_nohz_handler.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt > 8.38 ± 5% +1.1 9.51 ± 3% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt > 10.78 ± 5% +1.3 12.08 ± 3% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt > 11.04 ± 5% +1.3 12.36 ± 3% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt > 11.77 ± 5% +1.4 13.19 ± 2% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt > 12.68 ± 5% +1.6 14.23 ± 2% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt > 12.84 -0.4 12.47 ± 2% perf-profile.children.cycles-pp.strcmp@plt > 0.65 ± 10% -0.2 0.50 ± 30% perf-profile.children.cycles-pp.__perf_mmap__read_init > 0.12 ± 3% +0.0 0.14 ± 4% perf-profile.children.cycles-pp.sched_clock > 0.05 ± 8% +0.0 0.08 ± 8% perf-profile.children.cycles-pp.rb_next > 0.09 ± 7% +0.0 0.13 ± 10% perf-profile.children.cycles-pp.timerqueue_del > 0.20 ± 4% +0.0 0.23 ± 4% perf-profile.children.cycles-pp.lapic_next_deadline > 0.40 ± 7% +0.0 0.44 ± 5% perf-profile.children.cycles-pp.handle_softirqs > 0.42 ± 4% +0.0 0.47 ± 5% perf-profile.children.cycles-pp.native_irq_return_iret > 0.20 ± 9% +0.1 0.26 ± 6% perf-profile.children.cycles-pp.__cgroup_account_cputime_field > 0.12 ± 4% +0.1 0.19 ± 10% perf-profile.children.cycles-pp._raw_spin_lock > 0.75 ± 4% +0.1 0.82 ± 3% perf-profile.children.cycles-pp.update_load_avg > 0.40 ± 4% +0.1 0.47 ± 2% perf-profile.children.cycles-pp.hrtimer_active > 0.34 ± 9% +0.1 0.41 ± 5% perf-profile.children.cycles-pp.irqtime_account_irq > 0.52 ± 5% +0.1 0.60 ± 3% perf-profile.children.cycles-pp.__irq_exit_rcu > 0.12 ± 9% +0.1 0.20 ± 6% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime > 0.36 ± 2% +0.1 0.44 ± 3% perf-profile.children.cycles-pp.task_mm_cid_work > 0.50 ± 6% +0.1 0.58 ± 4% perf-profile.children.cycles-pp.account_user_time > 0.38 ± 2% +0.1 0.48 ± 3% perf-profile.children.cycles-pp.task_work_run > 0.46 ± 2% +0.1 0.56 ± 3% perf-profile.children.cycles-pp.irqentry_exit_to_user_mode > 1.32 ± 9% +0.3 1.64 ± 4% perf-profile.children.cycles-pp.update_curr > 2.76 ± 6% +0.5 3.25 ± 3% perf-profile.children.cycles-pp.task_tick_fair > 5.24 ± 6% +0.8 6.04 ± 3% perf-profile.children.cycles-pp.sched_tick > 7.99 ± 5% +1.2 9.14 ± 3% perf-profile.children.cycles-pp.update_process_times > 8.45 ± 5% +1.2 9.63 ± 3% perf-profile.children.cycles-pp.tick_nohz_handler > 8.75 ± 5% +1.2 9.99 ± 3% perf-profile.children.cycles-pp.__hrtimer_run_queues > 11.23 ± 5% +1.4 12.66 ± 3% perf-profile.children.cycles-pp.hrtimer_interrupt > 11.50 ± 5% +1.4 12.94 ± 3% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt > 12.26 ± 5% +1.6 13.82 ± 3% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt > 13.22 ± 5% +1.7 14.92 ± 2% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt > 6.12 -0.2 5.92 ± 2% perf-profile.self.cycles-pp.strcmp@plt > 0.09 ± 11% +0.0 0.11 ± 6% perf-profile.self.cycles-pp.__hrtimer_run_queues > 0.10 ± 9% +0.0 0.12 ± 5% perf-profile.self.cycles-pp.task_tick_fair > 0.05 ± 8% +0.0 0.08 ± 10% perf-profile.self.cycles-pp.rb_next > 0.20 ± 4% +0.0 0.23 ± 3% perf-profile.self.cycles-pp.lapic_next_deadline > 0.42 ± 4% +0.0 0.47 ± 5% perf-profile.self.cycles-pp.native_irq_return_iret > 0.16 ± 14% +0.1 0.21 ± 4% perf-profile.self.cycles-pp.__cgroup_account_cputime_field > 0.12 ± 4% +0.1 0.18 ± 10% perf-profile.self.cycles-pp._raw_spin_lock > 0.24 ± 9% +0.1 0.30 ± 7% perf-profile.self.cycles-pp.irqtime_account_irq > 0.40 ± 4% +0.1 0.46 ± 3% perf-profile.self.cycles-pp.hrtimer_active > 0.10 ± 3% +0.1 0.18 ± 8% perf-profile.self.cycles-pp.hrtimer_interrupt > 0.11 ± 11% +0.1 0.20 ± 6% perf-profile.self.cycles-pp.perf_trace_sched_stat_runtime > 0.33 ± 3% +0.1 0.42 ± 4% perf-profile.self.cycles-pp.task_mm_cid_work > 0.88 ± 11% +0.2 1.05 ± 5% perf-profile.self.cycles-pp.update_curr > > > > > > Disclaimer: > Results have been estimated based on internal Intel analysis and are provided > for informational purposes only. Any difference in system hardware or software > design or configuration may affect actual performance. > >