Re: [PATCH v3 2/6] KVM: VMX: Avoid retpoline call for control register caused exits

Mathias Krause <minipli@xxxxxxxxxxxxxx> · Mon, 20 Mar 2023 21:43:44 +0100

On Wed, Mar 15, 2023 at 02:38:33PM -0700, Sean Christopherson wrote:
> On Wed, Feb 01, 2023, Mathias Krause wrote:
> > Complement commit 4289d2728664 ("KVM: retpolines: x86: eliminate
> > retpoline from vmx.c exit handlers") and avoid a retpoline call for
> > control register accesses as well.
> > 
> > This speeds up guests that make heavy use of it, like grsecurity
> > kernels toggling CR0.WP to implement kernel W^X.
> 
> I would rather drop this patch for VMX and instead unconditionally make CR0.WP
> guest owned when TDP (EPT) is enabled, i.e. drop the module param from patch 6.

That's fine with me. As EPT usually implies TDP (if neither of both was
explicitly disabled) that should be no limitation and as the non-EPT
case only saw a very small gain from this change anyways (less than 1%)
we can drop it.

> 
> > Signed-off-by: Mathias Krause <minipli@xxxxxxxxxxxxxx>
> > ---
> > 
> > Meanwhile I got my hands on a AMD system and while doing a similar change
> > for SVM gives a small measurable win (1.1% faster for grsecurity guests),
> 
> Mostly out of curiosity...
> 
> Is the 1.1% roughly aligned with the gains for VMX?  If VMX sees a significantly
> larger improvement, any idea why SVM doesn't benefit as much?  E.g. did you double
> check that the kernel was actually using RETPOLINE?

I measured the runtime of the ssdd test I used before and got 3.98s for
a kernel with the whole series applied and 3.94s with the below change
on top:

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d13cf53e7390..2a471eae11c6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3369,6 +3369,10 @@ int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
 		return intr_interception(vcpu);
 	else if (exit_code == SVM_EXIT_HLT)
 		return kvm_emulate_halt(vcpu);
+	else if (exit_code == SVM_EXIT_READ_CR0 ||
+		 exit_code == SVM_EXIT_WRITE_CR0 ||
+		 exit_code == SVM_EXIT_CR0_SEL_WRITE)
+		return cr_interception(vcpu);
 	else if (exit_code == SVM_EXIT_NPF)
 		return npf_interception(vcpu);
 #endif

Inspecting svm_invoke_exit_handler() on the host with perf confirmed it
could use the direct call of cr_interception() most of the time, thereby
could avoid the retpoline for it:
(My version of perf is, apparently, unable to detect tail calls properly
and therefore lacks symbol information for the jump targets in the below
assembly dump. I therefore added these manually.)

Percent│
       │
       │     ffffffffc194c410 <load0>:	# svm_invoke_exit_handler
  5.00 │       nop
  7.44 │       push   %rbp
 10.43 │       cmp    $0x403,%rsi
  5.86 │       mov    %rdi,%rbp
  1.23 │       push   %rbx
  2.11 │       mov    %rsi,%rbx
  4.60 │       jbe    7a
       │ 16:   [svm_handle_invalid_exit() path removed]
  4.59 │ 7a:   mov    -0x3e6a5b00(,%rsi,8),%rax
  4.52 │       test   %rax,%rax
       │       je     16
  6.25 │       cmp    $0x7c,%rsi
       │       je     dd
  4.18 │       cmp    $0x64,%rsi
       │       je     f2
  3.26 │       cmp    $0x60,%rsi
       │       je     ca
  4.57 │       cmp    $0x78,%rsi
       │       je     f9
  1.27 │       test   $0xffffffffffffffef,%rsi
       │       je     c3
  1.67 │       cmp    $0x65,%rsi
       │       je     c3
       │       cmp    $0x400,%rsi
       │       je     13d
       │       pop    %rbx
       │       pop    %rbp
       │       jmp    0xffffffffa0487d80	# __x86_indirect_thunk_rax
       │       int3
 11.68 │ c3:   pop    %rbx
 10.01 │       pop    %rbp
 10.47 │       jmp    0xffffffffc19482a0	# cr_interception
       │ ca:   incq   0x1940(%rdi)
       │       mov    $0x1,%eax
       │       pop    %rbx
  0.42 │       pop    %rbp
       │       ret
       │       int3
       │       int3
       │       int3
       │       int3
       │ dd:   mov    0x1a20(%rdi),%rax
       │       cmpq   $0x0,0x78(%rax)
       │       je     100
       │       pop    %rbx
       │       pop    %rbp
       │       jmp    0xffffffffc185af20	# kvm_emulate_wrmsr
       │ f2:   pop    %rbx
       │       pop    %rbp
  0.42 │       jmp    0xffffffffc19472b0	# interrupt_window_interception
       │ f9:   pop    %rbx
       │       pop    %rbp
       │       jmp    0xffffffffc185a6a0	# kvm_emulate_halt
       │100:   pop    %rbx
       │       pop    %rbp
       │       jmp    0xffffffffc18602a0	# kvm_emulate_rdmsr
       │107:   mov    %rbp,%rdi
       │       mov    $0x10,%esi
       │       call   kvm_register_read_raw
       │       mov    0x24(%rbp),%edx
       │       mov    %rax,%rcx
       │       mov    %rbx,%r8
       │       mov    %gs:0x2ac00,%rax
       │       mov    0x95c(%rax),%esi
       │       mov    $0xffffffffc195dc28,%rdi
       │       call   _printk
       │       jmp    31
       │13d:   pop    %rbx
       │       pop    %rbp
       │       jmp    0xffffffffc1946b90	# npf_interception

What's clear from above (or so I hope!), cr_interception() is *the* reason to
cause a VM exit for my test run and by taking the shortcut via a direct call,
it doesn't have to do the retpoline dance which might be the explanation for
the ~1.1% performance gain (even in the face of three additional compare
instructions). However! As I realized that these three more instructions
probably "hurt" all other workloads (that don't toggle CR0.WP as often as a
grsecurity kernel would do), I didn't include the above change as a patch of
the series. If you think it's worth it nonetheless, as VM exits shouldn't
happen often anyways, I can do a proper patch.

> 
> > it would provide nothing for other guests, as the change I was testing was
> > specifically targeting CR0 caused exits.
> > 
> > A more general approach would instead cover CR3 and, maybe, CR4 as well.
> > However, that would require a lot more exit code compares, likely
> > vanishing the gains in the general case. So this tweak is VMX only.
> 
> I don't think targeting on CR0 exits is a reason to not do this for SVM.  With
> NPT enabled, CR3 isn't intercepted, and CR4 exits should be very rare.  If the
> performance benefits are marginal (I don't have a good frame of reference for the
> 1.1%), then _that's_ a good reason to leave SVM alone.  But not giving CR3 and CR4
> priority is a non-issue.

Ok. But yeah, the win isn't all the big either, less so in real
workloads that won't exercise this code path so often.

> 
> >  arch/x86/kvm/vmx/vmx.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index c788aa382611..c8198c8a9b55 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6538,6 +6538,8 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> >  		return handle_external_interrupt(vcpu);
> >  	else if (exit_reason.basic == EXIT_REASON_HLT)
> >  		return kvm_emulate_halt(vcpu);
> > +	else if (exit_reason.basic == EXIT_REASON_CR_ACCESS)
> > +		return handle_cr(vcpu);
> >  	else if (exit_reason.basic == EXIT_REASON_EPT_MISCONFIG)
> >  		return handle_ept_misconfig(vcpu);
> >  #endif
> > -- 
> > 2.39.1
> >