Re: [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On May 17, 2022, at 9:42 PM, Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> 
> On Thu, May 12, 2022, Jon Kohler wrote:
>> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
>> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
>> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
>> the MSR bitmap.
>> 
>> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
>> once or twice per vCPU on boot, so it is far better to take those
>> VM exits on boot than having to read and save this msr on every
>> single VM exit forever. This outcome was suggested on Andrea's commit
>> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
>> however, since interception is still unilaterally disabled, the rdmsr
>> tax is still there even after that commit.
>> 
>> This is a significant win for eIBRS enabled systems as this rdmsr
>> accounts for roughly ~50% of time for vmx_vcpu_run() as observed
>> by perf top disassembly, and is in the critical path for all
>> VM-Exits, including fastpath exits.
>> 
>> Update relevant comments in vmx_vcpu_run() with appropriate SDM
>> references for future onlookers.
>> 
>> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
>> Signed-off-by: Jon Kohler <jon@xxxxxxxxxxx>
>> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
>> Cc: Kees Cook <keescook@xxxxxxxxxxxx>
>> Cc: Josh Poimboeuf <jpoimboe@xxxxxxxxxx>
>> Cc: Waiman Long <longman@xxxxxxxxxx>
>> ---
>> arch/x86/kvm/vmx/vmx.c | 46 +++++++++++++++++++++++++++++++-----------
>> 1 file changed, 34 insertions(+), 12 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index d58b763df855..d9da6fcecd8c 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2056,6 +2056,25 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> 		if (kvm_spec_ctrl_test_value(data))
>> 			return 1;
>> 
>> +		/*
>> +		 * For Intel eIBRS, IBRS (SPEC_CTRL_IBRS aka 0x00000048 BIT(0))
>> +		 * only needs to be set once and can be left on forever without
>> +		 * needing to be constantly toggled. If the guest attempts to
>> +		 * write that value, let's not disable interception. Guests
>> +		 * with eIBRS awareness should only be writing SPEC_CTRL_IBRS
>> +		 * once per vCPU per boot.
>> +		 *
>> +		 * The guest can still use other SPEC_CTRL features on top of
>> +		 * eIBRS such as SSBD, and we should disable interception for
> 
> Please don't add comments that say "should" or "need", just state what is being
> done.  Just because a comment says XYZ should do something doesn't necessarily
> mean that XYZ actually does that thing.
> 
>> +		 * those situations to avoid a multitude of VM-Exits's;
>> +		 * however, we will need to check SPEC_CTRL on each exit to
>> +		 * make sure we restore the host value properly.
>> +		 */
> 
> This whole comment can be more succint.  Better yet, merge it with the comment
> below (out of sight in this diff) and opportunistically update that comment to
> reflect what actually happens if L2 is the first to write a non-zero value (arguably
> a bug that should be fixed, but meh).  The IBPB MSR has the same flaw.  :-/
> 
>> +		if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && data == BIT(0)) {
> 
> Use SPEC_CTRL_IBRS instead of open coding "BIT(0)", then a chunk of the comment
> goes away.
> 
>> +			vmx->spec_ctrl = data;
>> +			break;
>> +		}
> 
> There's no need for a separate if statement.  And the boot_cpu_has() check can
> be dropped, kvm_spec_ctrl_test_value() has already verified the bit is writable
> (unless you're worried about bit 0 being used for something else?)
> 
>> +
>> 		vmx->spec_ctrl = data;
>> 		if (!data)
>> 			break;
>> @@ -6887,19 +6906,22 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>> 	vmx_vcpu_enter_exit(vcpu, vmx);
>> 
>> 	/*
>> -	 * We do not use IBRS in the kernel. If this vCPU has used the
>> -	 * SPEC_CTRL MSR it may have left it on; save the value and
>> -	 * turn it off. This is much more efficient than blindly adding
>> -	 * it to the atomic save/restore list. Especially as the former
>> -	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
>> -	 *
>> -	 * For non-nested case:
>> -	 * If the L01 MSR bitmap does not intercept the MSR, then we need to
>> -	 * save it.
>> +	 * SDM 25.1.3 - handle conditional exit for MSR_IA32_SPEC_CTRL.
>> +	 * To prevent constant VM exits for SPEC_CTRL, kernel may
> 
> Please wrap at 80 chars (ignore the bad examples in KVM).
> 
>> +	 * disable interception in the MSR bitmap for SPEC_CTRL MSR,
>> +	 * such that the guest can read and write to that MSR without
>> +	 * trapping to KVM; however, the guest may set a different
>> +	 * value than the host. For exit handling, do rdmsr below if
>> +	 * interception is disabled, such that we can save the guest
>> +	 * value for restore on VM entry, as it does not get saved
>> +	 * automatically per SDM 27.3.1.
>> 	 *
>> -	 * For nested case:
>> -	 * If the L02 MSR bitmap does not intercept the MSR, then we need to
>> -	 * save it.
>> +	 * This behavior is optimized on eIBRS enabled systems, such
>> +	 * that the kernel only disables interception for MSR_IA32_SPEC_CTRL
>> +	 * when guests choose to use additional SPEC_CTRL features
>> +	 * above and beyond IBRS, such as STIBP or SSBD. This
>> +	 * optimization allows the kernel to avoid doing the expensive
>> +	 * rdmsr below.
> 
> I don't see any reason to restate why the MSR may or may not be intercepted, just
> explain when the value needs to be read.
> 
> E.g. something like this for the whole patch?

Sean - Thanks for the feedback as always, good stuff. I’ll pull this together,
Double check the testing side, and send out a v2 with all this incorporated.

> 
> ---
> arch/x86/kvm/vmx/vmx.c | 60 +++++++++++++++++++-----------------------
> 1 file changed, 27 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 610355b9ccce..70d863a7882d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2057,20 +2057,30 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> 			return 1;
> 
> 		vmx->spec_ctrl = data;
> -		if (!data)
> +
> +		/*
> +		 * Disable interception on the first non-zero write, unless the
> +		 * guest is setting only SPEC_CTRL_IBRS, which is typically set
> +		 * once at boot and never touched again.  All other bits are
> +		 * often set on a per-task basis, i.e. may change frequently,
> +		 * so the benefit of avoiding VM-exits during guest context
> +		 * switches outweighs the cost of RDMSR on every VM-Exit to
> +		 * save the guest's value.
> +		 */
> +		if (!data || data == SPEC_CTRL_IBRS)
> 			break;
> 
> 		/*
> -		 * For non-nested:
> -		 * When it's written (to non-zero) for the first time, pass
> -		 * it through.
> -		 *
> -		 * For nested:
> -		 * The handling of the MSR bitmap for L2 guests is done in
> -		 * nested_vmx_prepare_msr_bitmap. We should not touch the
> -		 * vmcs02.msr_bitmap here since it gets completely overwritten
> -		 * in the merging. We update the vmcs01 here for L1 as well
> -		 * since it will end up touching the MSR anyway now.
> +		 * Update vmcs01.msr_bitmap even if L2 is active, i.e. disable
> +		 * interception for the vCPU on the first write regardless of
> +		 * whether the WRMSR came from L1 or L2.  vmcs02's bitmap is a
> +		 * combination of vmcs01 and vmcs12 bitmaps, and will be
> +		 * recomputed by nested_vmx_prepare_msr_bitmap() on the next
> +		 * nested VM-Enter.  Note, this does mean that future WRMSRs
> +		 * from L2 will be intercepted until the next nested VM-Exit if
> +		 * L2 was the first to write, but L1 exposing the MSR to L2
> +		 * without first writing it is unlikely and not worth the
> +		 * extra bit of complexity.
> 		 */
> 		vmx_disable_intercept_for_msr(vcpu,
> 					      MSR_IA32_SPEC_CTRL,
> @@ -2098,15 +2108,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> 		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
> 
> 		/*
> -		 * For non-nested:
> -		 * When it's written (to non-zero) for the first time, pass
> -		 * it through.
> -		 *
> -		 * For nested:
> -		 * The handling of the MSR bitmap for L2 guests is done in
> -		 * nested_vmx_prepare_msr_bitmap. We should not touch the
> -		 * vmcs02.msr_bitmap here since it gets completely overwritten
> -		 * in the merging.
> +		 * Disable interception on the first IBPB, odds are good IBPB
> +		 * will be a frequent guest action.  See the comment for
> +		 * MSR_IA32_SPEC_CTRL for details on the nested interaction.
> 		 */
> 		vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PRED_CMD, MSR_TYPE_W);
> 		break;
> @@ -6887,19 +6891,9 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> 	vmx_vcpu_enter_exit(vcpu, vmx);
> 
> 	/*
> -	 * We do not use IBRS in the kernel. If this vCPU has used the
> -	 * SPEC_CTRL MSR it may have left it on; save the value and
> -	 * turn it off. This is much more efficient than blindly adding
> -	 * it to the atomic save/restore list. Especially as the former
> -	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
> -	 *
> -	 * For non-nested case:
> -	 * If the L01 MSR bitmap does not intercept the MSR, then we need to
> -	 * save it.
> -	 *
> -	 * For nested case:
> -	 * If the L02 MSR bitmap does not intercept the MSR, then we need to
> -	 * save it.
> +	 * Save SPEC_CTRL if it may have been written by the guest, the current
> +	 * value in hardware is used by x86_spec_ctrl_restore_host() to avoid
> +	 * WRMSR if the current value matches the host's desired value.
> 	 */
> 	if (unlikely(!msr_write_intercepted(vmx, MSR_IA32_SPEC_CTRL)))
> 		vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
> 
> base-commit: 69b59889b0147aa80098936e383b06fec30cdf5c
> --





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux