> On May 17, 2022, at 9:42 PM, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Thu, May 12, 2022, Jon Kohler wrote: >> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on >> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS) >> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in >> the MSR bitmap. >> >> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR >> once or twice per vCPU on boot, so it is far better to take those >> VM exits on boot than having to read and save this msr on every >> single VM exit forever. This outcome was suggested on Andrea's commit >> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl") >> however, since interception is still unilaterally disabled, the rdmsr >> tax is still there even after that commit. >> >> This is a significant win for eIBRS enabled systems as this rdmsr >> accounts for roughly ~50% of time for vmx_vcpu_run() as observed >> by perf top disassembly, and is in the critical path for all >> VM-Exits, including fastpath exits. >> >> Update relevant comments in vmx_vcpu_run() with appropriate SDM >> references for future onlookers. >> >> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl") >> Signed-off-by: Jon Kohler <jon@xxxxxxxxxxx> >> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> >> Cc: Kees Cook <keescook@xxxxxxxxxxxx> >> Cc: Josh Poimboeuf <jpoimboe@xxxxxxxxxx> >> Cc: Waiman Long <longman@xxxxxxxxxx> >> --- >> arch/x86/kvm/vmx/vmx.c | 46 +++++++++++++++++++++++++++++++----------- >> 1 file changed, 34 insertions(+), 12 deletions(-) >> >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c >> index d58b763df855..d9da6fcecd8c 100644 >> --- a/arch/x86/kvm/vmx/vmx.c >> +++ b/arch/x86/kvm/vmx/vmx.c >> @@ -2056,6 +2056,25 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) >> if (kvm_spec_ctrl_test_value(data)) >> return 1; >> >> + /* >> + * For Intel eIBRS, IBRS (SPEC_CTRL_IBRS aka 0x00000048 BIT(0)) >> + * only needs to be set once and can be left on forever without >> + * needing to be constantly toggled. If the guest attempts to >> + * write that value, let's not disable interception. Guests >> + * with eIBRS awareness should only be writing SPEC_CTRL_IBRS >> + * once per vCPU per boot. >> + * >> + * The guest can still use other SPEC_CTRL features on top of >> + * eIBRS such as SSBD, and we should disable interception for > > Please don't add comments that say "should" or "need", just state what is being > done. Just because a comment says XYZ should do something doesn't necessarily > mean that XYZ actually does that thing. > >> + * those situations to avoid a multitude of VM-Exits's; >> + * however, we will need to check SPEC_CTRL on each exit to >> + * make sure we restore the host value properly. >> + */ > > This whole comment can be more succint. Better yet, merge it with the comment > below (out of sight in this diff) and opportunistically update that comment to > reflect what actually happens if L2 is the first to write a non-zero value (arguably > a bug that should be fixed, but meh). The IBPB MSR has the same flaw. :-/ > >> + if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && data == BIT(0)) { > > Use SPEC_CTRL_IBRS instead of open coding "BIT(0)", then a chunk of the comment > goes away. > >> + vmx->spec_ctrl = data; >> + break; >> + } > > There's no need for a separate if statement. And the boot_cpu_has() check can > be dropped, kvm_spec_ctrl_test_value() has already verified the bit is writable > (unless you're worried about bit 0 being used for something else?) > >> + >> vmx->spec_ctrl = data; >> if (!data) >> break; >> @@ -6887,19 +6906,22 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) >> vmx_vcpu_enter_exit(vcpu, vmx); >> >> /* >> - * We do not use IBRS in the kernel. If this vCPU has used the >> - * SPEC_CTRL MSR it may have left it on; save the value and >> - * turn it off. This is much more efficient than blindly adding >> - * it to the atomic save/restore list. Especially as the former >> - * (Saving guest MSRs on vmexit) doesn't even exist in KVM. >> - * >> - * For non-nested case: >> - * If the L01 MSR bitmap does not intercept the MSR, then we need to >> - * save it. >> + * SDM 25.1.3 - handle conditional exit for MSR_IA32_SPEC_CTRL. >> + * To prevent constant VM exits for SPEC_CTRL, kernel may > > Please wrap at 80 chars (ignore the bad examples in KVM). > >> + * disable interception in the MSR bitmap for SPEC_CTRL MSR, >> + * such that the guest can read and write to that MSR without >> + * trapping to KVM; however, the guest may set a different >> + * value than the host. For exit handling, do rdmsr below if >> + * interception is disabled, such that we can save the guest >> + * value for restore on VM entry, as it does not get saved >> + * automatically per SDM 27.3.1. >> * >> - * For nested case: >> - * If the L02 MSR bitmap does not intercept the MSR, then we need to >> - * save it. >> + * This behavior is optimized on eIBRS enabled systems, such >> + * that the kernel only disables interception for MSR_IA32_SPEC_CTRL >> + * when guests choose to use additional SPEC_CTRL features >> + * above and beyond IBRS, such as STIBP or SSBD. This >> + * optimization allows the kernel to avoid doing the expensive >> + * rdmsr below. > > I don't see any reason to restate why the MSR may or may not be intercepted, just > explain when the value needs to be read. > > E.g. something like this for the whole patch? Sean - Thanks for the feedback as always, good stuff. I’ll pull this together, Double check the testing side, and send out a v2 with all this incorporated. > > --- > arch/x86/kvm/vmx/vmx.c | 60 +++++++++++++++++++----------------------- > 1 file changed, 27 insertions(+), 33 deletions(-) > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 610355b9ccce..70d863a7882d 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -2057,20 +2057,30 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > return 1; > > vmx->spec_ctrl = data; > - if (!data) > + > + /* > + * Disable interception on the first non-zero write, unless the > + * guest is setting only SPEC_CTRL_IBRS, which is typically set > + * once at boot and never touched again. All other bits are > + * often set on a per-task basis, i.e. may change frequently, > + * so the benefit of avoiding VM-exits during guest context > + * switches outweighs the cost of RDMSR on every VM-Exit to > + * save the guest's value. > + */ > + if (!data || data == SPEC_CTRL_IBRS) > break; > > /* > - * For non-nested: > - * When it's written (to non-zero) for the first time, pass > - * it through. > - * > - * For nested: > - * The handling of the MSR bitmap for L2 guests is done in > - * nested_vmx_prepare_msr_bitmap. We should not touch the > - * vmcs02.msr_bitmap here since it gets completely overwritten > - * in the merging. We update the vmcs01 here for L1 as well > - * since it will end up touching the MSR anyway now. > + * Update vmcs01.msr_bitmap even if L2 is active, i.e. disable > + * interception for the vCPU on the first write regardless of > + * whether the WRMSR came from L1 or L2. vmcs02's bitmap is a > + * combination of vmcs01 and vmcs12 bitmaps, and will be > + * recomputed by nested_vmx_prepare_msr_bitmap() on the next > + * nested VM-Enter. Note, this does mean that future WRMSRs > + * from L2 will be intercepted until the next nested VM-Exit if > + * L2 was the first to write, but L1 exposing the MSR to L2 > + * without first writing it is unlikely and not worth the > + * extra bit of complexity. > */ > vmx_disable_intercept_for_msr(vcpu, > MSR_IA32_SPEC_CTRL, > @@ -2098,15 +2108,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB); > > /* > - * For non-nested: > - * When it's written (to non-zero) for the first time, pass > - * it through. > - * > - * For nested: > - * The handling of the MSR bitmap for L2 guests is done in > - * nested_vmx_prepare_msr_bitmap. We should not touch the > - * vmcs02.msr_bitmap here since it gets completely overwritten > - * in the merging. > + * Disable interception on the first IBPB, odds are good IBPB > + * will be a frequent guest action. See the comment for > + * MSR_IA32_SPEC_CTRL for details on the nested interaction. > */ > vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PRED_CMD, MSR_TYPE_W); > break; > @@ -6887,19 +6891,9 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) > vmx_vcpu_enter_exit(vcpu, vmx); > > /* > - * We do not use IBRS in the kernel. If this vCPU has used the > - * SPEC_CTRL MSR it may have left it on; save the value and > - * turn it off. This is much more efficient than blindly adding > - * it to the atomic save/restore list. Especially as the former > - * (Saving guest MSRs on vmexit) doesn't even exist in KVM. > - * > - * For non-nested case: > - * If the L01 MSR bitmap does not intercept the MSR, then we need to > - * save it. > - * > - * For nested case: > - * If the L02 MSR bitmap does not intercept the MSR, then we need to > - * save it. > + * Save SPEC_CTRL if it may have been written by the guest, the current > + * value in hardware is used by x86_spec_ctrl_restore_host() to avoid > + * WRMSR if the current value matches the host's desired value. > */ > if (unlikely(!msr_write_intercepted(vmx, MSR_IA32_SPEC_CTRL))) > vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL); > > base-commit: 69b59889b0147aa80098936e383b06fec30cdf5c > --