This is the second version of the patch - it adds a test for boot_cpu_data.x86_clflush_size. There may be CPUs with different cache line size and we don't want to run the 64-byte aligned loop on them. Mikulas From: Mikulas Patocka <mpatocka@xxxxxxxxxx> memcpy_flushcache: use cache flusing for larger lengths I tested dm-writecache performance on a machine with Optane nvdimm and it turned out that for larger writes, cached stores + cache flushing perform better than non-temporal stores. This is the throughput of dm-writecache measured with this command: dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct block size 512 1024 2048 4096 movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s We can see that for smaller block, movnti performs better, but for larger blocks, clflushopt has better performance. This patch changes the function __memcpy_flushcache accordingly, so that with size >= 768 it performs cached stores and cache flushing. Note that we must not use the new branch if the CPU doesn't have clflushopt - in that case, the kernel would use inefficient "clflush" instruction that has very bad performance. Signed-off-by: Mikulas Patocka <mpatocka@xxxxxxxxxx> --- arch/x86/lib/usercopy_64.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) Index: linux-2.6/arch/x86/lib/usercopy_64.c =================================================================== --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-03-24 15:15:36.644945091 -0400 +++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-03-30 07:17:51.450290007 -0400 @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con return; } + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) { + while (!IS_ALIGNED(dest, 64)) { + asm("movq (%0), %%r8\n" + "movnti %%r8, (%1)\n" + :: "r" (source), "r" (dest) + : "memory", "r8"); + dest += 8; + source += 8; + size -= 8; + } + do { + asm("movq (%0), %%r8\n" + "movq 8(%0), %%r9\n" + "movq 16(%0), %%r10\n" + "movq 24(%0), %%r11\n" + "movq %%r8, (%1)\n" + "movq %%r9, 8(%1)\n" + "movq %%r10, 16(%1)\n" + "movq %%r11, 24(%1)\n" + "movq 32(%0), %%r8\n" + "movq 40(%0), %%r9\n" + "movq 48(%0), %%r10\n" + "movq 56(%0), %%r11\n" + "movq %%r8, 32(%1)\n" + "movq %%r9, 40(%1)\n" + "movq %%r10, 48(%1)\n" + "movq %%r11, 56(%1)\n" + :: "r" (source), "r" (dest) + : "memory", "r8", "r9", "r10", "r11"); + clflushopt((void *)dest); + dest += 64; + source += 64; + size -= 64; + } while (size >= 64); + } + /* 4x8 movnti loop */ while (size >= 32) { asm("movq (%0), %%r8\n" -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel