On Wed, Mar 15, 2023 at 02:38:33PM -0700, Sean Christopherson wrote: > On Wed, Feb 01, 2023, Mathias Krause wrote: > > Complement commit 4289d2728664 ("KVM: retpolines: x86: eliminate > > retpoline from vmx.c exit handlers") and avoid a retpoline call for > > control register accesses as well. > > > > This speeds up guests that make heavy use of it, like grsecurity > > kernels toggling CR0.WP to implement kernel W^X. > > I would rather drop this patch for VMX and instead unconditionally make CR0.WP > guest owned when TDP (EPT) is enabled, i.e. drop the module param from patch 6. That's fine with me. As EPT usually implies TDP (if neither of both was explicitly disabled) that should be no limitation and as the non-EPT case only saw a very small gain from this change anyways (less than 1%) we can drop it. > > > Signed-off-by: Mathias Krause <minipli@xxxxxxxxxxxxxx> > > --- > > > > Meanwhile I got my hands on a AMD system and while doing a similar change > > for SVM gives a small measurable win (1.1% faster for grsecurity guests), > > Mostly out of curiosity... > > Is the 1.1% roughly aligned with the gains for VMX? If VMX sees a significantly > larger improvement, any idea why SVM doesn't benefit as much? E.g. did you double > check that the kernel was actually using RETPOLINE? I measured the runtime of the ssdd test I used before and got 3.98s for a kernel with the whole series applied and 3.94s with the below change on top: diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index d13cf53e7390..2a471eae11c6 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3369,6 +3369,10 @@ int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code) return intr_interception(vcpu); else if (exit_code == SVM_EXIT_HLT) return kvm_emulate_halt(vcpu); + else if (exit_code == SVM_EXIT_READ_CR0 || + exit_code == SVM_EXIT_WRITE_CR0 || + exit_code == SVM_EXIT_CR0_SEL_WRITE) + return cr_interception(vcpu); else if (exit_code == SVM_EXIT_NPF) return npf_interception(vcpu); #endif Inspecting svm_invoke_exit_handler() on the host with perf confirmed it could use the direct call of cr_interception() most of the time, thereby could avoid the retpoline for it: (My version of perf is, apparently, unable to detect tail calls properly and therefore lacks symbol information for the jump targets in the below assembly dump. I therefore added these manually.) Percent│ │ │ ffffffffc194c410 <load0>: # svm_invoke_exit_handler 5.00 │ nop 7.44 │ push %rbp 10.43 │ cmp $0x403,%rsi 5.86 │ mov %rdi,%rbp 1.23 │ push %rbx 2.11 │ mov %rsi,%rbx 4.60 │ jbe 7a │ 16: [svm_handle_invalid_exit() path removed] 4.59 │ 7a: mov -0x3e6a5b00(,%rsi,8),%rax 4.52 │ test %rax,%rax │ je 16 6.25 │ cmp $0x7c,%rsi │ je dd 4.18 │ cmp $0x64,%rsi │ je f2 3.26 │ cmp $0x60,%rsi │ je ca 4.57 │ cmp $0x78,%rsi │ je f9 1.27 │ test $0xffffffffffffffef,%rsi │ je c3 1.67 │ cmp $0x65,%rsi │ je c3 │ cmp $0x400,%rsi │ je 13d │ pop %rbx │ pop %rbp │ jmp 0xffffffffa0487d80 # __x86_indirect_thunk_rax │ int3 11.68 │ c3: pop %rbx 10.01 │ pop %rbp 10.47 │ jmp 0xffffffffc19482a0 # cr_interception │ ca: incq 0x1940(%rdi) │ mov $0x1,%eax │ pop %rbx 0.42 │ pop %rbp │ ret │ int3 │ int3 │ int3 │ int3 │ dd: mov 0x1a20(%rdi),%rax │ cmpq $0x0,0x78(%rax) │ je 100 │ pop %rbx │ pop %rbp │ jmp 0xffffffffc185af20 # kvm_emulate_wrmsr │ f2: pop %rbx │ pop %rbp 0.42 │ jmp 0xffffffffc19472b0 # interrupt_window_interception │ f9: pop %rbx │ pop %rbp │ jmp 0xffffffffc185a6a0 # kvm_emulate_halt │100: pop %rbx │ pop %rbp │ jmp 0xffffffffc18602a0 # kvm_emulate_rdmsr │107: mov %rbp,%rdi │ mov $0x10,%esi │ call kvm_register_read_raw │ mov 0x24(%rbp),%edx │ mov %rax,%rcx │ mov %rbx,%r8 │ mov %gs:0x2ac00,%rax │ mov 0x95c(%rax),%esi │ mov $0xffffffffc195dc28,%rdi │ call _printk │ jmp 31 │13d: pop %rbx │ pop %rbp │ jmp 0xffffffffc1946b90 # npf_interception What's clear from above (or so I hope!), cr_interception() is *the* reason to cause a VM exit for my test run and by taking the shortcut via a direct call, it doesn't have to do the retpoline dance which might be the explanation for the ~1.1% performance gain (even in the face of three additional compare instructions). However! As I realized that these three more instructions probably "hurt" all other workloads (that don't toggle CR0.WP as often as a grsecurity kernel would do), I didn't include the above change as a patch of the series. If you think it's worth it nonetheless, as VM exits shouldn't happen often anyways, I can do a proper patch. > > > it would provide nothing for other guests, as the change I was testing was > > specifically targeting CR0 caused exits. > > > > A more general approach would instead cover CR3 and, maybe, CR4 as well. > > However, that would require a lot more exit code compares, likely > > vanishing the gains in the general case. So this tweak is VMX only. > > I don't think targeting on CR0 exits is a reason to not do this for SVM. With > NPT enabled, CR3 isn't intercepted, and CR4 exits should be very rare. If the > performance benefits are marginal (I don't have a good frame of reference for the > 1.1%), then _that's_ a good reason to leave SVM alone. But not giving CR3 and CR4 > priority is a non-issue. Ok. But yeah, the win isn't all the big either, less so in real workloads that won't exercise this code path so often. > > > arch/x86/kvm/vmx/vmx.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > index c788aa382611..c8198c8a9b55 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -6538,6 +6538,8 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) > > return handle_external_interrupt(vcpu); > > else if (exit_reason.basic == EXIT_REASON_HLT) > > return kvm_emulate_halt(vcpu); > > + else if (exit_reason.basic == EXIT_REASON_CR_ACCESS) > > + return handle_cr(vcpu); > > else if (exit_reason.basic == EXIT_REASON_EPT_MISCONFIG) > > return handle_ept_misconfig(vcpu); > > #endif > > -- > > 2.39.1 > >