On 05/12/2023 14:19, Kefeng Wang wrote: > > > On 2023/12/4 18:20, Ryan Roberts wrote: >> Hi All, >> >> A new week, a new version, a new name... This is v8 of a series to implement >> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP" >> and "large anonymous folios"). Matthew objected to "small huge" so hopefully >> this fares better. >> >> The objective of this is to improve performance by allocating larger chunks of >> memory during anonymous page faults: >> >> 1) Since SW (the kernel) is dealing with larger chunks of memory than base >> pages, there are efficiency savings to be had; fewer page faults, batched PTE >> and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel >> overhead. This should benefit all architectures. >> 2) Since we are now mapping physically contiguous chunks of memory, we can take >> advantage of HW TLB compression techniques. A reduction in TLB pressure >> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce >> TLB entries; "the contiguous bit" (architectural) and HPA (uarch). >> >> This version changes the name and tidies up some of the kernel code and test >> code, based on feedback against v7 (see change log for details). >> >> By default, the existing behaviour (and performance) is maintained. The user >> must explicitly enable multi-size THP to see the performance benefit. This is >> done via a new sysfs interface (as recommended by David Hildenbrand - thanks to >> David for the suggestion)! This interface is inspired by the existing >> per-hugepage-size sysfs interface used by hugetlb, provides full backwards >> compatibility with the existing PMD-size THP interface, and provides a base for >> future extensibility. See [8] for detailed discussion of the interface. >> >> This series is based on mm-unstable (715b67adf4c8). >> >> >> Prerequisites >> ============= >> >> Some work items identified as being prerequisites are listed on page 3 at [9]. >> The summary is: >> >> | item | status | >> |:------------------------------|:------------------------| >> | mlock | In mainline (v6.7) | >> | madvise | In mainline (v6.6) | >> | compaction | v1 posted [10] | >> | numa balancing | Investigated: see below | >> | user-triggered page migration | In mainline (v6.7) | >> | khugepaged collapse | In mainline (NOP) | >> >> On NUMA balancing, which currently ignores any PTE-mapped THPs it encounters, >> John Hubbard has investigated this and concluded that it is A) not clear at the >> moment what a better policy might be for PTE-mapped THP and B) questions whether >> this should really be considered a prerequisite given no regression is caused >> for the default "multi-size THP disabled" case, and there is no correctness >> issue when it is enabled - its just a potential for non-optimal performance. >> >> If there are no disagreements about removing numa balancing from the list (none >> were raised when I first posted this comment against v7), then that just leaves >> compaction which is in review on list at the moment. >> >> I really would like to get this series (and its remaining comapction >> prerequisite) in for v6.8. I accept that it may be a bit optimistic at this >> point, but lets see where we get to with review? >> >> >> Testing >> ======= >> >> The series includes patches for mm selftests to enlighten the cow and khugepaged >> tests to explicitly test with multi-size THP, in the same way that PMD-sized >> THP is tested. The new tests all pass, and no regressions are observed in the mm >> selftest suite. I've also run my usual kernel compilation and java script >> benchmarks without any issues. >> >> Refer to my performance numbers posted with v6 [6]. (These are for multi-size >> THP only - they do not include the arm64 contpte follow-on series). >> >> John Hubbard at Nvidia has indicated dramatic 10x performance improvements for >> some workloads at [11]. (Observed using v6 of this series as well as the arm64 >> contpte series). >> >> Kefeng Wang at Huawei has also indicated he sees improvements at [12] although >> there are some latency regressions also. > > Hi Ryan, > > Here is some test results based on v6.7-rc1 + > [PATCH v7 00/10] Small-sized THP for anonymous memory + > [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings > > case1: basepage 64K > case2: basepage 4K + thp=64k + PAGE_ALLOC_COSTLY_ORDER = 3 > case3: basepage 4K + thp=64k + PAGE_ALLOC_COSTLY_ORDER = 4 Thanks for sharing these results. With the exception of a few outliers, It looks like the ~rough conclusion is that bandwidth improves, but not as much as 64K base pages, and latency regresses, but also not as much as 64K base pages? I expect that over time, as we add more optimizations, we will get bandwidth closer to 64K base pages; one crucial one is getting executable file-backed memory into contpte mappings, for example. It's probably not time to switch PAGE_ALLOC_COSTLY_ORDER quite yet; but something to keep an eye on and consider down the road? Thanks, Ryan > > The results is compared with basepage 4K on Kunpeng920. > > Note, > - The test based on ext4 filesystem and THP=2M is disabled. > - The results were not analyzed, it is for reference only, > as some values of test items are not consistent. > > 1) Unixbench 1core > Index_Values_1core case1 case2 case3 > Dhrystone_2_using_register_variables 0.28% 0.39% 0.17% > Double-Precision_Whetstone -0.01% 0.00% 0.00% > Execl_Throughput *21.13%* 2.16% 3.01% > File_Copy_1024_bufsize_2000_maxblocks -0.51% *8.33%* *8.76%* > File_Copy_256_bufsize_500_maxblocks 0.78% *11.89%* *10.85%* > File_Copy_4096_bufsize_8000_maxblocks 7.42% 7.27% *10.66%* > Pipe_Throughput -0.24% *6.82%* *5.08%* > Pipe-based_Context_Switching 1.38% *13.49%* *9.91%* > Process_Creation *32.46%* 4.30% *8.54%* > Shell_Scripts_(1_concurrent) *31.67%* 1.92% 2.60% > Shell_Scripts_(8_concurrent) *40.59%* 1.30% *5.29%* > System_Call_Overhead 3.92% *8.13% 2.96% > > System_Benchmarks_Index_Score 10.66% 5.39% 5.58% > > For 1core, > - case1 wins on Execl_Throughput/Process_Creation/Shell_Scripts > a lot, and score higher 10.66% vs basepage 4K. > - case2/3 wins on File_Copy/Pipe and score higher 5%+ than basepage 4K, > also case3 looks better on Shell_Scripts_(8_concurrent) than case2. > > 2) Unixbench 128core > Index_Values_128core case1 case2 case3 > Dhrystone_2_using_register_variables 2.07% -0.03% -0.11% > Double-Precision_Whetstone -0.03% 0.00% 0.00% > Execl_Throughput *39.28%* -4.23% 1.93% > File_Copy_1024_bufsize_2000_maxblocks 5.46% 1.30% 4.20% > File_Copy_256_bufsize_500_maxblocks -8.89% *6.56% *5.02%* > File_Copy_4096_bufsize_8000_maxblocks 3.43% *-5.46%* 0.56% > Pipe_Throughput 3.80% *7.69% *7.80%* > Pipe-based_Context_Switching *7.62%* 0.95% 4.69% > Process_Creation *28.11%* -2.79% 2.40% > Shell_Scripts_(1_concurrent) *39.68%* 1.86% *5.30%* > Shell_Scripts_(8_concurrent) *41.35%* 2.49% *7.16%* > System_Call_Overhead -1.55% -0.04% *8.23%* > > System_Benchmarks_Index_Score 12.08% 0.63% 3.88% > > For 128core, > - case1 wins on Execl_Throughput/Process_Creation/Shell_Scripts > a lot, also good at Pipe-based_Context_Switching, and score higher > 12.08% vs basepage 4K. > - case2/case3 wins on File_Copy_256/Pipe_Throughput, but case2 is > not better than basepage 4K, case3 wins 3.88%. > > 3) Lmbench Processor_processes > Processor_Processes case1 case2 case3 > null_call 1.76% 0.40% 0.65% > null_io -0.76% -0.38% -0.23% > stat *-16.09%* *-12.49%* 4.22% > open_close -2.69% 4.51% 3.21% > slct_TCP -0.56% 0.00% -0.44% > sig_inst -1.54% 0.73% 0.70% > sig_hndl -2.85% 0.01% 1.85% > fork_proc *23.31%* 8.77% -5.42% > exec_proc *13.22%* -0.30% 1.09% > sh_proc *14.04%* -0.10% 1.09% > > - case1 is much better than basepage 4K, same as Unixbench test, > case2 is better on fork_proc, but case3 is worse > - note: the variance of fork/exec/sh is bigger than others > > 4) Lmbench Context_switching_ctxsw > Context_switching_ctxsw case1 case2 case3 > 2p/0K -12.16% -5.29% -1.86% > 2p/16K -11.26% -3.71% -4.53% > 2p/64K -2.60% 3.84% -1.98% > 8p/16K -7.56% -1.21% -0.88% > 8p/64K 5.10% 4.88% 1.19% > 16p/16K -5.81% -2.44% -3.84% > 16p/64K 4.29% -1.94% -2.50% > - case1/2/3 worse than basepage 4K and case1 is the worst. > > 4) Lmbench Local_latencies > Local_latencies case1 case2 case3 > Pipe -9.23% 0.58% -4.34% > AF_UNIX -5.34% -1.76% 3.03% > UDP -6.70% -5.96% -9.81% > TCP -7.95% -7.58% -5.63% > TCP_conn -213.99% -227.78% -659.67% > - TCP_conn is very unreliable, ignore it > - case1/2/3 slower than basepage 4K > > 5) Lmbench File_&_VM_latencies > File_&_VM_latencies case1 case2 case3 > 10K_File_Create 2.60% -0.52% 2.66% > 10K_File_Delete -2.91% -5.20% -2.11% > 10K_File_Create 10.23% 1.18% 0.12% > 10K_File_Delete -17.76% -2.97% -1.49% > Mmap_Latency *63.05%* 2.57% -0.96% > Prot_Fault 10.41% -3.21% *-19.11%* > Page_Fault *-132.01%* 2.35% -0.79% > 100fd_selct -1.20% 0.10% 0.31% > - case1 is very good at Mmap_Latency and not good at Page_fault > - case2/3 slower on Prot_Faul/10K_FILE_Delete vs basepage 4k, > the rest doesn't look much different. > > 6) Lmbench Local_bandwidths > Local_bandwidths case1 case2 case3 > Pipe 265.22% 15.44% 11.33% > AF_UNIX 13.41% -2.66% 2.63% > TCP -1.30% 25.90% 2.48% > File_reread 14.79% 31.52% -14.16% > Mmap_reread 27.47% 49.00% -0.11% > Bcopy(libc) 2.58% 2.45% 2.46% > Bcopy(hand) 25.78% 22.56% 22.68% > Mem_read 38.26% 36.80% 36.49% > Mem_write 10.93% 3.44% 3.12% > > - case1 is very good at bandwidth, case2 is better than basepage 4k > but lower than case1, case3 is bad at File_reread > > 7)Lmbench Memory_latencies > Memory_latencies case1 case2 case3 > L1_$ 0.02% 0.00% -0.03% > L2_$ -1.56% -2.65% -1.25% > Main_mem 50.82% 32.51% 33.47% > Rand_mem 15.29% -8.79% -8.80% > > - case1 also good at Main/Rand mem access latencies, > - case2/case3 is better at Main_mem, but worse at Rand_mem. > > Tested-by: Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> > > > > > > >