* Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote: > System: Oracle X6-2 > CPU: 2 nodes * 10 cores/node * 2 threads/core > Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1) > Memory: 256 GB evenly split between nodes > Microcode: 0xb00002e > scaling_governor: performance > L3 size: 25MB > intel_pstate/no_turbo: 1 > > Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb > (X86_FEATURE_ERMS) and x86-64-movnt (X86_FEATURE_NT_GOOD): > > x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup > ----------------------- ----------------------- ------- > size BW ( pstdev) BW ( pstdev) > > 16MB 17.35 GB/s ( +- 9.27%) 11.83 GB/s ( +- 0.19%) -31.81% > 128MB 5.31 GB/s ( +- 0.13%) 11.72 GB/s ( +- 0.44%) +121.84% > 1024MB 5.42 GB/s ( +- 0.13%) 11.78 GB/s ( +- 0.03%) +117.34% > 4096MB 5.41 GB/s ( +- 0.41%) 11.76 GB/s ( +- 0.07%) +117.37% > + if (c->x86 == 6 && c->x86_model == INTEL_FAM6_BROADWELL_X) > + set_cpu_cap(c, X86_FEATURE_NT_GOOD); So while I agree with how you've done careful measurements to isolate bad microarchitectures where non-temporal stores are slow, I do think this approach of opt-in doesn't scale and is hard to maintain. Instead I'd suggest enabling this by default everywhere, and creating a X86_FEATURE_NT_BAD quirk table for the bad microarchitectures. This means that with new microarchitectures we'd get automatic enablement, and hopefully chip testing would identify cases where performance isn't as good. I.e. the 'trust but verify' method. Thanks, Ingo