* Borislav Petkov <bp@xxxxxxxxx> wrote: > Ok, > > finally a somewhat final version, lightly tested. > > I still need to run it on production Icelake and that is kinda being > delayed due to server room cooling issues (don't ask ;-\). > So Mel gave me the idea to simply measure how fast the function becomes. > I.e.: > > start = rdtsc_ordered(); > ret = __clear_user(to, n); > end = rdtsc_ordered(); > > Computing the mean average of all the samples collected during the test > suite run then shows some improvement: > > clear_user_original: > Amean: 9219.71 (Sum: 6340154910, samples: 687674) > > fsrm: > Amean: 8030.63 (Sum: 5522277720, samples: 687652) > > That's on Zen3. As a side note, there's some rudimentary perf tooling that allows the user-space testing of kernel-space x86 memcpy and memset implementations: $ perf bench mem memcpy # Running 'mem/memcpy' benchmark: # function 'default' (Default memcpy() provided by glibc) # Copying 1MB bytes ... 42.459239 GB/sec # function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S) # Copying 1MB bytes ... 23.818598 GB/sec # function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S) # Copying 1MB bytes ... 10.172526 GB/sec # function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S) # Copying 1MB bytes ... 10.614810 GB/sec Note how the actual implementation in arch/x86/lib/memcpy_64.S was used to build a user-space test into 'perf bench'. For copy_user() & clear_user() some additional wrappery would be needed I guess, to wrap away stac()/clac()/might_sleep(), etc. ... [ Plus it could all be improved to measure cache hot & cache cold performance, to use different sizes, etc. ] Even with the limitation that it's not 100% equivalent to the kernel-space thing, especially for very short buffers, having the whole perf side benchmarking, profiling & statistics machinery available is a plus I think. Thanks, Ingo