This series adds clear_page_nt(), a non-temporal MOV (MOVNTI) based clear_page(). The immediate use case is to speedup creation of large (~2TB) guests VMs. Memory for these guests is allocated via huge/gigantic pages which are faulted in early. The intent behind using non-temporal writes is to minimize allocation of unnecessary cachelines. This helps in minimizing cache pollution, and potentially also speeds up zeroing of large extents. That said there are, uncached writes are not always great, as can be seen in these 'perf bench mem memset' numbers comparing clear_page_erms() and clear_page_nt(): Intel Broadwellx: x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup ----------------------- ----------------------- ------- size BW ( pstdev) BW ( pstdev) 16MB 17.35 GB/s ( +- 9.27%) 11.83 GB/s ( +- 0.19%) -31.81% 128MB 5.31 GB/s ( +- 0.13%) 11.72 GB/s ( +- 0.44%) +121.84% AMD Rome: x86-64-stosq (5 runs) x86-64-movnt (5 runs) speedup ----------------------- ----------------------- ------- size BW ( pstdev) BW ( pstdev) 16MB 15.39 GB/s ( +- 9.14%) 14.56 GB/s ( +-19.43%) -5.39% 128MB 11.04 GB/s ( +- 4.87%) 14.49 GB/s ( +-13.22%) +31.25% Intel Skylakex: x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup ----------------------- ----------------------- ------- size BW ( pstdev) BW ( pstdev) 16MB 20.38 GB/s ( +- 2.58%) 6.25 GB/s ( +- 0.41%) -69.28% 128MB 6.52 GB/s ( +- 0.14%) 6.31 GB/s ( +- 0.47%) -3.22% (All of the machines in these tests had a minimum of 25MB L3 cache per socket.) There are two performance issues: - uncached writes typically perform better only for region sizes sizes around or larger than ~LLC-size. - MOVNTI does not always perform well on all microarchitectures. We handle the first issue by only using clear_page_nt() for GB pages. That leaves out page zeroing for 2MB pages, which is a size that's large enough that uncached writes might have meaningful cache benefits but at the same time is small enough that uncached writes would end up being slower. We can handle a subset of the 2MB case -- mmaps with MAP_POPULATE -- by means of a uncached-or-cached hint decided based on a threshold size. This would apply to maps backed by any page-size. This case is not handled in this series -- I wanted to sanity check the high level approach before attempting that. Handle the second issue by adding a synthetic cpu-feature, X86_FEATURE_NT_GOOD which is only enabled for architectures where MOVNTI performs well. (Relatedly, I thought I had independently decided to use ALTERNATIVES to deal with this, but more likely I had just internalized it from this discussion: https://lore.kernel.org/linux-mm/20200316101856.GH11482@xxxxxxxxxxxxxx/#t) Accordingly this series enables X86_FEATURE_NT_GOOD for Intel Broadwellx and AMD Rome. (In my testing, the performance was also good for some pre-production models but this series leaves them out.) Please review. Thanks Ankur Ankur Arora (8): x86/cpuid: add X86_FEATURE_NT_GOOD x86/asm: add memset_movnti() perf bench: add memset_movnti() x86/asm: add clear_page_nt() x86/clear_page: add clear_page_uncached() mm, clear_huge_page: use clear_page_uncached() for gigantic pages x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/page.h | 6 +++ arch/x86/include/asm/page_32.h | 9 ++++ arch/x86/include/asm/page_64.h | 15 ++++++ arch/x86/kernel/cpu/amd.c | 3 ++ arch/x86/kernel/cpu/intel.c | 2 + arch/x86/lib/clear_page_64.S | 26 +++++++++++ arch/x86/lib/memset_64.S | 68 ++++++++++++++++------------ include/asm-generic/page.h | 3 ++ include/linux/highmem.h | 10 ++++ mm/memory.c | 3 +- tools/arch/x86/lib/memset_64.S | 68 ++++++++++++++++------------ tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 ++- 13 files changed, 158 insertions(+), 62 deletions(-) -- 2.9.3