Re: [PATCH v4] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS

Jon Kohler <jon@xxxxxxxxxxx> · Thu, 1 Jun 2023 19:28:47 +0000

> On Jun 1, 2023, at 1:43 PM, Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> 
> On Wed, May 31, 2023, Jon Kohler wrote:
>> 
>>> On May 31, 2023, at 2:30 PM, Jim Mattson <jmattson@xxxxxxxxxx> wrote:
>>> 
>>> On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@xxxxxxxxxxx> wrote:
>>>> Yea, I thought about it. One one hand, simplicity is kingand on the other
>>>> hand, not having to think about this again is nice too.
>>>> 
>>>> The challenge in my mind is that on setups where this truly is static, we’d
>>>> be taking some incremental amount of memory to keep the counter around,
> 
> Not really.  The vCPU structures are already order-2 allocations, increasing the
> size by 8-16 bytes doesn't affect the actual memory usage in practice.  Death by
> a thousand cuts is a potential problem, but we're a ways away from crossing back
> over into order-3 allocations.
> 
>>>> just to have the same outcome each time. Doesn’t feel right (to me) unless that is
>>>> also used for “other” stuff as some sort of general purpose/common counter.
> 
> ...
> 
>> Yes, there is places this could be stuffed I’m sure. Still feels a bit heavy
>> handed for the same-outcome-every-time situations though.
> 
> There's no guarantee the outcome will be the same.  You're assuming that (a) the
> guest is eIBRS aware, (b) SPEC_CTRL doesn't get extended for future mitigations,
> and (c) that if L1 is running VMs of its own, that L1 is advertising eIBRS to L2
> and that the L2 kernel is also aware of eIBRS.
> 
>>>> RE Cost: I can’t put my finger on it, but I swear that RDMSR for *this*
>>>> specific MSR is more expensive than any other RDMSR I’ve come across
>>>> for run-of-the-mill random MSRs. I flipped thru the SDM and the mitigations
>>>> documentation, and it only ever mentions that there is a notable cost to
>>>> do WRMSR IA32_SPEC_CTRL, but nothing about the RDMSR side.
>>>> 
>>>> If anyone happens to know from an Intel-internals perspective, I’d be quite
>>>> interested to know why it just “feels” so darn costly. i.e. is the proc also doing
>>>> special things under the covers, similar to what the processor does on
>>>> writes to this one?
>>> 
>>> What do you mean by "feels"? Have you measured it?
>> 
>> There are plenty of rdmsr’s scattered around the entry and exit paths that get
>> hit every time, but this is far and away always the most expensive one when
>> profiling with perf top. I haven’t measured it separately from the existing code,
>> But rather noted during profiling that it appears to be nastier than others. 
>> 
>> I’m more curious than anything else, but it doesn’t matter all that much going
>> forward since this commit will nuke it from orbit for the run of the mill 
>> eIBRS-only use cases.
> 
> As above, you're making multiple assumptions that may or may not hold true.  I
> agree with Jim, reacting to what the guest is actually doing is more robust than
> assuming the guest will do XYZ based on the vCPU model or some other heuristic.
> 
> The code isn't that complex, and KVM can even reuse the number of exits snapshot
> to periodically re-enable the intercept, e.g. to avoid unnecessary RDMSRs if the
> vCPU stops writing MSR_IA32_SPEC_CTRL for whatever reason.
> 
> Needs actual testing and informed magic numbers, but I think this captures the
> gist of what Jim is suggesting.

Thanks, Sean, Jim. I agree that having something robust and lightweight would be
real nice here. Thanks, Sean for the suggested code. I’ll take that, do some
testing, and report back.

> 
> ---
> arch/x86/include/asm/kvm_host.h |  3 +++
> arch/x86/kvm/svm/svm.c          | 22 ++++++++--------------
> arch/x86/kvm/vmx/vmx.c          | 28 ++++++++++------------------
> arch/x86/kvm/x86.h              | 24 ++++++++++++++++++++++++
> 4 files changed, 45 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fb9d1f2d6136..3fdb6048cd58 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -966,6 +966,9 @@ struct kvm_vcpu_arch {
> 	/* Host CPU on which VM-entry was most recently attempted */
> 	int last_vmentry_cpu;
> 
> +	u32 nr_quick_spec_ctrl_writes;
> +	u64 spec_ctrl_nr_exits_snapshot;
> +
> 	/* AMD MSRC001_0015 Hardware Configuration */
> 	u64 msr_hwcr;
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index ca32389f3c36..f749613204d3 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2959,21 +2959,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
> 			svm->vmcb->save.spec_ctrl = data;
> 		else
> 			svm->spec_ctrl = data;
> -		if (!data)
> -			break;
> 
> -		/*
> -		 * For non-nested:
> -		 * When it's written (to non-zero) for the first time, pass
> -		 * it through.
> -		 *
> -		 * For nested:
> -		 * The handling of the MSR bitmap for L2 guests is done in
> -		 * nested_svm_vmrun_msrpm.
> -		 * We update the L1 MSR bit as well since it will end up
> -		 * touching the MSR anyway now.
> -		 */
> -		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
> +		if (!msr->host_initiated &&
> +		    kvm_account_msr_spec_ctrl_write(vcpu))
> +			set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
> 		break;
> 	case MSR_AMD64_VIRT_SPEC_CTRL:
> 		if (!msr->host_initiated &&
> @@ -4158,6 +4147,11 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
> 
> 	svm_complete_interrupts(vcpu);
> 
> +	if (!static_cpu_has(X86_FEATURE_V_SPEC_CTRL) &&
> +	    !spec_ctrl_intercepted &&
> +	    kvm_account_msr_spec_ctrl_passthrough(vcpu))
> +		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 0, 0);
> +
> 	if (is_guest_mode(vcpu))
> 		return EXIT_FASTPATH_NONE;
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 44fb619803b8..4f4a2c3507bc 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2260,24 +2260,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> 			return 1;
> 
> 		vmx->spec_ctrl = data;
> -		if (!data)
> -			break;
> 
> -		/*
> -		 * For non-nested:
> -		 * When it's written (to non-zero) for the first time, pass
> -		 * it through.
> -		 *
> -		 * For nested:
> -		 * The handling of the MSR bitmap for L2 guests is done in
> -		 * nested_vmx_prepare_msr_bitmap. We should not touch the
> -		 * vmcs02.msr_bitmap here since it gets completely overwritten
> -		 * in the merging. We update the vmcs01 here for L1 as well
> -		 * since it will end up touching the MSR anyway now.
> -		 */
> -		vmx_disable_intercept_for_msr(vcpu,
> -					      MSR_IA32_SPEC_CTRL,
> -					      MSR_TYPE_RW);
> +		if (msr_info->host_initiated &&
> +		    kvm_account_msr_spec_ctrl_write(vcpu))
> +			vmx_disable_intercept_for_msr(vcpu, MSR_IA32_SPEC_CTRL,
> +						      MSR_TYPE_RW);
> 		break;
> 	case MSR_IA32_TSX_CTRL:
> 		if (!msr_info->host_initiated &&
> @@ -7192,6 +7179,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> {
> 	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	unsigned int run_flags = __vmx_vcpu_run_flags(vmx);
> 	unsigned long cr3, cr4;
> 
> 	/* Record the guest's net vcpu time for enforced NMI injections. */
> @@ -7280,7 +7268,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> 	kvm_wait_lapic_expire(vcpu);
> 
> 	/* The actual VMENTER/EXIT is in the .noinstr.text section. */
> -	vmx_vcpu_enter_exit(vcpu, __vmx_vcpu_run_flags(vmx));
> +	vmx_vcpu_enter_exit(vcpu, run_flags);
> 
> 	/* All fields are clean at this point */
> 	if (kvm_is_using_evmcs()) {
> @@ -7346,6 +7334,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> 	vmx_recover_nmi_blocking(vmx);
> 	vmx_complete_interrupts(vmx);
> 
> +	if ((run_flags & VMX_RUN_SAVE_SPEC_CTRL) &&
> +	    kvm_account_msr_spec_ctrl_passthrough(vcpu))
> +		vmx_enable_intercept_for_msr(vcpu, MSR_IA32_SPEC_CTRL, MSR_TYPE_RW);
> +
> 	if (is_guest_mode(vcpu))
> 		return EXIT_FASTPATH_NONE;
> 
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index c544602d07a3..454bcbf5b543 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -492,7 +492,31 @@ static inline void kvm_machine_check(void)
> 
> void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
> void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
> +
> int kvm_spec_ctrl_test_value(u64 value);
> +
> +static inline bool kvm_account_msr_spec_ctrl_write(struct kvm_vcpu *vcpu)
> +{
> +	if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 20)
> +		vcpu->arch.nr_quick_spec_ctrl_writes++;
> +	else
> +		vcpu->arch.nr_quick_spec_ctrl_writes = 0;
> +
> +	vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits;
> +
> +	return vcpu->arch.nr_quick_spec_ctrl_writes >= 10;
> +}
> +
> +static inline bool kvm_account_msr_spec_ctrl_passthrough(struct kvm_vcpu *vcpu)
> +{
> +	if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 100000)
> +		return false;
> +
> +	vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits;
> +	vcpu->arch.nr_quick_spec_ctrl_writes = 0;
> +	return true;
> +}
> +
> bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
> int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
> 			      struct x86_exception *e);
> 
> base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f
> --