On Mon, Jun 6, 2022 at 11:39 PM Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote: > > Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive. > With this, page-clearing can skip the memory hierarchy, thus providing > a non cache-polluting implementation of clear_pages(). > > MOVNTI, from the Intel SDM, Volume 2B, 4-101: > "The non-temporal hint is implemented by using a write combining (WC) > memory type protocol when writing the data to memory. Using this > protocol, the processor does not write the data into the cache > hierarchy, nor does it fetch the corresponding cache line from memory > into the cache hierarchy." > > The AMD Arch Manual has something similar to say as well. > > One use-case is to zero large extents without bringing in never-to-be- > accessed cachelines. Also, often clear_pages_movnt() based clearing is > faster once extent sizes are O(LLC-size). > > As the excerpt notes, MOVNTI is weakly ordered with respect to other > instructions operating on the memory hierarchy. This needs to be > handled by the caller by executing an SFENCE when done. > > The implementation is straight-forward: unroll the inner loop to keep > the code similar to memset_movnti(), so that we can gauge > clear_pages_movnt() performance via perf bench mem memset. > > # Intel Icelakex > # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb > # (X86_FEATURE_ERMS) and x86-64-movnt: > > System: Oracle X9-2 (2 nodes * 32 cores * 2 threads) > Processor: Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6) > Memory: 512 GB evenly split between nodes > LLC-size: 48MB for each node (32-cores * 2-threads) > no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance > > x86-64-stosb (5 runs) x86-64-movnt (5 runs) Delta(%) > ---------------------- --------------------- -------- > size BW ( stdev) BW ( stdev) > > 2MB 14.37 GB/s ( +- 1.55) 12.59 GB/s ( +- 1.20) -12.38% > 16MB 16.93 GB/s ( +- 2.61) 15.91 GB/s ( +- 2.74) -6.02% > 128MB 12.12 GB/s ( +- 1.06) 22.33 GB/s ( +- 1.84) +84.24% > 1024MB 12.12 GB/s ( +- 0.02) 23.92 GB/s ( +- 0.14) +97.35% > 4096MB 12.08 GB/s ( +- 0.02) 23.98 GB/s ( +- 0.18) +98.50% For these sizes it may be worth it to save/rstor an xmm register to do the memset: Just on my Tigerlake laptop: model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz movntdq xmm (5 runs) movnti GPR (5 runs) Delta(%) ----------------------- ----------------------- size BW GB/s ( +- stdev) BW GB/s ( +- stdev) % 2 MB 35.71 GB/s ( +- 1.02) 34.62 GB/s ( +- 0.77) -3.15% 16 MB 36.43 GB/s ( +- 0.35) 31.3 GB/s ( +- 0.1) -16.39% 128 MB 35.6 GB/s ( +- 0.83) 30.82 GB/s ( +- 0.08) -15.5% 1024 MB 36.85 GB/s ( +- 0.26) 30.71 GB/s ( +- 0.2) -20.0% > > Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx> > --- > arch/x86/include/asm/page_64.h | 1 + > arch/x86/lib/clear_page_64.S | 21 +++++++++++++++++++++ > 2 files changed, 22 insertions(+) > > diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h > index a88a3508888a..3affc4ecb8da 100644 > --- a/arch/x86/include/asm/page_64.h > +++ b/arch/x86/include/asm/page_64.h > @@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long); > void clear_pages_orig(void *page, unsigned long npages); > void clear_pages_rep(void *page, unsigned long npages); > void clear_pages_erms(void *page, unsigned long npages); > +void clear_pages_movnt(void *page, unsigned long npages); > > #define __HAVE_ARCH_CLEAR_USER_PAGES > static inline void clear_pages(void *page, unsigned int npages) > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S > index 2cc3b681734a..83d14f1c9f57 100644 > --- a/arch/x86/lib/clear_page_64.S > +++ b/arch/x86/lib/clear_page_64.S > @@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms) > RET > SYM_FUNC_END(clear_pages_erms) > EXPORT_SYMBOL_GPL(clear_pages_erms) > + > +SYM_FUNC_START(clear_pages_movnt) > + xorl %eax,%eax > + movq %rsi,%rcx > + shlq $PAGE_SHIFT, %rcx > + > + .p2align 4 > +.Lstart: > + movnti %rax, 0x00(%rdi) > + movnti %rax, 0x08(%rdi) > + movnti %rax, 0x10(%rdi) > + movnti %rax, 0x18(%rdi) > + movnti %rax, 0x20(%rdi) > + movnti %rax, 0x28(%rdi) > + movnti %rax, 0x30(%rdi) > + movnti %rax, 0x38(%rdi) > + addq $0x40, %rdi > + subl $0x40, %ecx > + ja .Lstart > + RET > +SYM_FUNC_END(clear_pages_movnt) > -- > 2.31.1 >