System: Oracle E2-2C CPU: 2 nodes * 64 cores/node * 2 threads/core AMD EPYC 7742 (Rome, 23:49:0) Memory: 2048 GB evenly split between nodes Microcode: 0x8301038 scaling_governor: performance L3 size: 16 * 16MB cpufreq/boost: 0 Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosq (X86_FEATURE_REP_GOOD) and x86-64-movnt (X86_FEATURE_NT_GOOD): x86-64-stosq (5 runs) x86-64-movnt (5 runs) speedup ----------------------- ----------------------- ------- size BW ( pstdev) BW ( pstdev) 16MB 15.39 GB/s ( +- 9.14%) 14.56 GB/s ( +-19.43%) -5.39% 128MB 11.04 GB/s ( +- 4.87%) 14.49 GB/s ( +-13.22%) +31.25% 1024MB 11.86 GB/s ( +- 0.83%) 16.54 GB/s ( +- 0.04%) +39.46% 4096MB 11.89 GB/s ( +- 0.61%) 16.49 GB/s ( +- 0.28%) +38.68% The next workload exercises the page-clearing path directly by faulting over an anonymous mmap region backed by 1GB pages. This workload is similar to the creation phase of pinned guests in QEMU. $ cat pf-test.c #include <stdlib.h> #include <sys/mman.h> #include <linux/mman.h> #define HPAGE_BITS 30 int main(int argc, char **argv) { int i; unsigned long len = atoi(argv[1]); /* In GB */ unsigned long offset = 0; unsigned long numpages; char *base; len *= 1UL << 30; numpages = len >> HPAGE_BITS; base = mmap(NULL, len, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_1GB, 0, 0); for (i = 0; i < numpages; i++) { *((volatile char *)base + offset) = *(base + offset); offset += 1UL << HPAGE_BITS; } return 0; } The specific test is for a 128GB region but this is a single-threaded O(n) workload so the exact region size is not material. Page-clearing throughput for clear_page_rep(): 11.33 GBps $ perf stat -r 5 --all-kernel -e ... bin/pf-test 128 Performance counter stats for 'bin/pf-test 128' (5 runs): 25,130,082,910 cpu-cycles # 2.226 GHz ( +- 0.44% ) (54.54%) 1,368,762,311 instructions # 0.05 insn per cycle ( +- 0.02% ) (54.54%) 4,265,726,534 cache-references # 377.794 M/sec ( +- 0.02% ) (54.54%) 119,021,793 cache-misses # 2.790 % of all cache refs ( +- 3.90% ) (54.55%) 413,825,787 branch-instructions # 36.650 M/sec ( +- 0.01% ) (54.55%) 236,847 branch-misses # 0.06% of all branches ( +- 18.80% ) (54.56%) 2,152,320,887 L1-dcache-load-misses # 40.40% of all L1-dcache accesses ( +- 0.01% ) (54.55%) 5,326,873,560 L1-dcache-loads # 471.775 M/sec ( +- 0.20% ) (54.55%) 828,943,234 L1-dcache-prefetches # 73.415 M/sec ( +- 0.55% ) (54.54%) 18,914 dTLB-loads # 0.002 M/sec ( +- 47.23% ) (54.54%) 4,423 dTLB-load-misses # 23.38% of all dTLB cache accesses ( +- 27.75% ) (54.54%) 11.2917 +- 0.0499 seconds time elapsed ( +- 0.44% ) Page-clearing throughput for clear_page_nt(): 16.29 GBps $ perf stat -r 5 --all-kernel -e ... bin/pf-test 128 Performance counter stats for 'bin/pf-test 128' (5 runs): 17,523,166,924 cpu-cycles # 2.230 GHz ( +- 0.03% ) (45.43%) 24,801,270,826 instructions # 1.42 insn per cycle ( +- 0.01% ) (45.45%) 2,151,391,033 cache-references # 273.845 M/sec ( +- 0.01% ) (45.46%) 168,555 cache-misses # 0.008 % of all cache refs ( +- 4.87% ) (45.47%) 2,490,226,446 branch-instructions # 316.974 M/sec ( +- 0.01% ) (45.48%) 117,604 branch-misses # 0.00% of all branches ( +- 1.56% ) (45.48%) 273,492 L1-dcache-load-misses # 0.06% of all L1-dcache accesses ( +- 2.14% ) (45.47%) 490,340,458 L1-dcache-loads # 62.414 M/sec ( +- 0.02% ) (45.45%) 20,517 L1-dcache-prefetches # 0.003 M/sec ( +- 9.61% ) (45.44%) 7,413 dTLB-loads # 0.944 K/sec ( +- 8.37% ) (45.44%) 2,031 dTLB-load-misses # 27.40% of all dTLB cache accesses ( +- 8.30% ) (45.43%) 7.85674 +- 0.00270 seconds time elapsed ( +- 0.03% ) The L1-dcache-load-misses (L2$ access from DC Miss) count is substantially lower which suggests we aren't doing write-allocate or RFO. The L1-dcache-prefetches are also substantially lower. Note that the IPC and instruction counts etc are quite different, but that's just an artifact of switching from a single 'REP; STOSQ' per PAGE_SIZE region to a MOVNTI loop. The page-clearing BW shows a ~40% improvement. Additionally, a quick 'perf bench memset' comparison on AMD Naples (AMD EPYC 7551) shows similar performance gains. So, enable X86_FEATURE_NT_GOOD for AMD Zen. Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx> --- arch/x86/kernel/cpu/amd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index dcc3d943c68f..c57eb6c28aa1 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -918,6 +918,9 @@ static void init_amd_zn(struct cpuinfo_x86 *c) { set_cpu_cap(c, X86_FEATURE_ZEN); + if (c->x86 == 0x17) + set_cpu_cap(c, X86_FEATURE_NT_GOOD); + #ifdef CONFIG_NUMA node_reclaim_distance = 32; #endif -- 2.9.3