Re: [PATCH] x86/clear_user: Make it faster

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 22 Jun 2022 10:06:42 -0500

On Wed, Jun 22, 2022 at 9:21 AM Borislav Petkov <bp@xxxxxxxxx> wrote:
>
> and frankly, this looks really weird:

I'm not sure how valid the TSC thing is, with the extra
synchronization maybe interacting with the whole microcode engine
startup/stop thing.

I'm also not sure the rdtsc is doing the same thing on your AMD tests
vs your Intel tests - I suspect you end up both using 'rdtscp' (as
opposed to the 'lsync' variant we also have), but I don't think the
ordering really is all that well defined architecturally, so AMD may
have very different serialization rules than Intel does.

.. and that serialization may well be different wrt normal load/stores
and microcode.

So those numbers look like they have a 3% difference, but I'm not 100%
convinced it might not be due to measuring artifacts. The fact that it
worked well for you on your AMD platform doesn't necessarily mean that
it has to work on icelake-x.

But it could equally easily be that "rep stosb" really just isn't any
better on that platform, and the numbers are just giving the plain
reality.

Or it could mean that it makes some cache access decision ("this is
big enough that let's not pollute L1 caches, do stores directly to
L2") that might be better for actual performance afterwards, but that
makes that clearing itself that bit slower.

IOW, I do think that microbenchmarks are kind of suspect to begin
with, and the rdtsc thing in particular may work better on some
microarchitectures than it does others.

Very hard to make a judgment call - I think the only thing that really
ends up mattering is the macro-benchmarks, but I think when you tried
that it was way too noisy to actually show any real signal.

That is, of course, a problem with memcpy and memset in general. It's
easy to do microbenchmarks for them, it's not just clear whether said
microbenchmarks give numbers that are actually meaningful, exactly
because of things like cache replacement policy etc.

And finally, I will repeat that this particular code probably just
isn't that important. The memory clearing for page allocation and
regular memcpy is where most of the real time is spent, so I don't
think that you should necessarily worry too much about this special
case.

                 Linus