Re: [PATCH] KVM: x86: Reduce retpoline performance impact in slot_handle_level_range()

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 2 Feb 2018 10:50:50 -0800

On Fri, Feb 2, 2018 at 6:59 AM, David Woodhouse <dwmw@xxxxxxxxxxxx> wrote:
> With retpoline, tight loops of "call this function for every XXX" are
> very much pessimised by taking a prediction miss *every* time.
>
> This one showed up very high in our early testing, and it only has five
> things it'll ever call so make it take an 'op' enum instead of a
> function pointer and let's see how that works out...

Umm. May I suggest a different workaround?

Honestly, if this is so performance-critical, the *real* fix is to
actually just mark all those "slot_handle_*()" functions as
"always_inline".

Because that will *really* improve performance, by simply removing the
indirection entirely - since then the functions involved will become
static. You might get other code improvements too, because I suspect
it will end up removing an extra level of function call due to those
trivial wrapper functions. And there's a couple of "bool
lock_flush_tlb" arguments that will simply become constant and
generate much better code that way.

And maybe you don't want to inline all of the slot_handle_*()
functions, and it's only one or two of them that matter because they
loop over a lot of entries, but honestly, most of those
slot_handle_xyz() functions seem to have just a couple of call sites
anyway.

slot_handle_large_level() is probably already inlined by the compiler
because it only has a single call-site.

Will it make for bigger code? Yes. But probably not really all *that*
much bigger, because of how it also will allow the compiler to
simplify some things.  An dif this really is so critical that those
non-predicted calls were that noticeable, those other simplifications
probably also matter.

And then  you get rid of all run-time conditionals, and all the
indirect jumps entirely. Plus the patch will be smaller and simpler
too.

Hmm?

              Linus