On Thu, Jun 1, 2023 at 12:28 PM Jon Kohler <jon@xxxxxxxxxxx> wrote: > > > > > On Jun 1, 2023, at 1:43 PM, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > On Wed, May 31, 2023, Jon Kohler wrote: > >> > >>> On May 31, 2023, at 2:30 PM, Jim Mattson <jmattson@xxxxxxxxxx> wrote: > >>> > >>> On Wed, May 31, 2023 at 11:17 AM Jon Kohler <jon@xxxxxxxxxxx> wrote: > >>>> Yea, I thought about it. One one hand, simplicity is kingand on the other > >>>> hand, not having to think about this again is nice too. > >>>> > >>>> The challenge in my mind is that on setups where this truly is static, we’d > >>>> be taking some incremental amount of memory to keep the counter around, > > > > Not really. The vCPU structures are already order-2 allocations, increasing the > > size by 8-16 bytes doesn't affect the actual memory usage in practice. Death by > > a thousand cuts is a potential problem, but we're a ways away from crossing back > > over into order-3 allocations. > > > >>>> just to have the same outcome each time. Doesn’t feel right (to me) unless that is > >>>> also used for “other” stuff as some sort of general purpose/common counter. > > > > ... > > > >> Yes, there is places this could be stuffed I’m sure. Still feels a bit heavy > >> handed for the same-outcome-every-time situations though. > > > > There's no guarantee the outcome will be the same. You're assuming that (a) the > > guest is eIBRS aware, (b) SPEC_CTRL doesn't get extended for future mitigations, > > and (c) that if L1 is running VMs of its own, that L1 is advertising eIBRS to L2 > > and that the L2 kernel is also aware of eIBRS. > > > >>>> RE Cost: I can’t put my finger on it, but I swear that RDMSR for *this* > >>>> specific MSR is more expensive than any other RDMSR I’ve come across > >>>> for run-of-the-mill random MSRs. I flipped thru the SDM and the mitigations > >>>> documentation, and it only ever mentions that there is a notable cost to > >>>> do WRMSR IA32_SPEC_CTRL, but nothing about the RDMSR side. > >>>> > >>>> If anyone happens to know from an Intel-internals perspective, I’d be quite > >>>> interested to know why it just “feels” so darn costly. i.e. is the proc also doing > >>>> special things under the covers, similar to what the processor does on > >>>> writes to this one? > >>> > >>> What do you mean by "feels"? Have you measured it? > >> > >> There are plenty of rdmsr’s scattered around the entry and exit paths that get > >> hit every time, but this is far and away always the most expensive one when > >> profiling with perf top. I haven’t measured it separately from the existing code, > >> But rather noted during profiling that it appears to be nastier than others. > >> > >> I’m more curious than anything else, but it doesn’t matter all that much going > >> forward since this commit will nuke it from orbit for the run of the mill > >> eIBRS-only use cases. > > > > As above, you're making multiple assumptions that may or may not hold true. I > > agree with Jim, reacting to what the guest is actually doing is more robust than > > assuming the guest will do XYZ based on the vCPU model or some other heuristic. > > > > The code isn't that complex, and KVM can even reuse the number of exits snapshot > > to periodically re-enable the intercept, e.g. to avoid unnecessary RDMSRs if the > > vCPU stops writing MSR_IA32_SPEC_CTRL for whatever reason. Bonus! It's nice to be able to respond to a state change. For instance, L1 might write IA32_SPEC_CTRL like crazy while running a certain L2, and then it might go back to not writing it at all. > > Needs actual testing and informed magic numbers, but I think this captures the > > gist of what Jim is suggesting. > Thanks, Sean, Jim. I agree that having something robust and lightweight would be > real nice here. Thanks, Sean for the suggested code. I’ll take that, do some > testing, and report back. > > > > --- > > arch/x86/include/asm/kvm_host.h | 3 +++ > > arch/x86/kvm/svm/svm.c | 22 ++++++++-------------- > > arch/x86/kvm/vmx/vmx.c | 28 ++++++++++------------------ > > arch/x86/kvm/x86.h | 24 ++++++++++++++++++++++++ > > 4 files changed, 45 insertions(+), 32 deletions(-) > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index fb9d1f2d6136..3fdb6048cd58 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -966,6 +966,9 @@ struct kvm_vcpu_arch { > > /* Host CPU on which VM-entry was most recently attempted */ > > int last_vmentry_cpu; > > > > + u32 nr_quick_spec_ctrl_writes; > > + u64 spec_ctrl_nr_exits_snapshot; > > + > > /* AMD MSRC001_0015 Hardware Configuration */ > > u64 msr_hwcr; > > > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > > index ca32389f3c36..f749613204d3 100644 > > --- a/arch/x86/kvm/svm/svm.c > > +++ b/arch/x86/kvm/svm/svm.c > > @@ -2959,21 +2959,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) > > svm->vmcb->save.spec_ctrl = data; > > else > > svm->spec_ctrl = data; > > - if (!data) > > - break; > > > > - /* > > - * For non-nested: > > - * When it's written (to non-zero) for the first time, pass > > - * it through. > > - * > > - * For nested: > > - * The handling of the MSR bitmap for L2 guests is done in > > - * nested_svm_vmrun_msrpm. > > - * We update the L1 MSR bit as well since it will end up > > - * touching the MSR anyway now. > > - */ > > - set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1); > > + if (!msr->host_initiated && > > + kvm_account_msr_spec_ctrl_write(vcpu)) > > + set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1); > > break; > > case MSR_AMD64_VIRT_SPEC_CTRL: > > if (!msr->host_initiated && > > @@ -4158,6 +4147,11 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu) > > > > svm_complete_interrupts(vcpu); > > > > + if (!static_cpu_has(X86_FEATURE_V_SPEC_CTRL) && > > + !spec_ctrl_intercepted && > > + kvm_account_msr_spec_ctrl_passthrough(vcpu)) > > + set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SPEC_CTRL, 0, 0); > > + > > if (is_guest_mode(vcpu)) > > return EXIT_FASTPATH_NONE; > > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > index 44fb619803b8..4f4a2c3507bc 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -2260,24 +2260,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > > return 1; > > > > vmx->spec_ctrl = data; > > - if (!data) > > - break; > > > > - /* > > - * For non-nested: > > - * When it's written (to non-zero) for the first time, pass > > - * it through. > > - * > > - * For nested: > > - * The handling of the MSR bitmap for L2 guests is done in > > - * nested_vmx_prepare_msr_bitmap. We should not touch the > > - * vmcs02.msr_bitmap here since it gets completely overwritten > > - * in the merging. We update the vmcs01 here for L1 as well > > - * since it will end up touching the MSR anyway now. > > - */ > > - vmx_disable_intercept_for_msr(vcpu, > > - MSR_IA32_SPEC_CTRL, > > - MSR_TYPE_RW); > > + if (msr_info->host_initiated && > > + kvm_account_msr_spec_ctrl_write(vcpu)) > > + vmx_disable_intercept_for_msr(vcpu, MSR_IA32_SPEC_CTRL, > > + MSR_TYPE_RW); > > break; > > case MSR_IA32_TSX_CTRL: > > if (!msr_info->host_initiated && > > @@ -7192,6 +7179,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, > > static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) > > { > > struct vcpu_vmx *vmx = to_vmx(vcpu); > > + unsigned int run_flags = __vmx_vcpu_run_flags(vmx); > > unsigned long cr3, cr4; > > > > /* Record the guest's net vcpu time for enforced NMI injections. */ > > @@ -7280,7 +7268,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) > > kvm_wait_lapic_expire(vcpu); > > > > /* The actual VMENTER/EXIT is in the .noinstr.text section. */ > > - vmx_vcpu_enter_exit(vcpu, __vmx_vcpu_run_flags(vmx)); > > + vmx_vcpu_enter_exit(vcpu, run_flags); > > > > /* All fields are clean at this point */ > > if (kvm_is_using_evmcs()) { > > @@ -7346,6 +7334,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) > > vmx_recover_nmi_blocking(vmx); > > vmx_complete_interrupts(vmx); > > > > + if ((run_flags & VMX_RUN_SAVE_SPEC_CTRL) && > > + kvm_account_msr_spec_ctrl_passthrough(vcpu)) > > + vmx_enable_intercept_for_msr(vcpu, MSR_IA32_SPEC_CTRL, MSR_TYPE_RW); > > + > > if (is_guest_mode(vcpu)) > > return EXIT_FASTPATH_NONE; > > > > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h > > index c544602d07a3..454bcbf5b543 100644 > > --- a/arch/x86/kvm/x86.h > > +++ b/arch/x86/kvm/x86.h > > @@ -492,7 +492,31 @@ static inline void kvm_machine_check(void) > > > > void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu); > > void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu); > > + > > int kvm_spec_ctrl_test_value(u64 value); > > + > > +static inline bool kvm_account_msr_spec_ctrl_write(struct kvm_vcpu *vcpu) > > +{ > > + if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 20) I think you mean 200 here. If it's bad to have more than 1 WRMSR(IA32_SPEC_CTRL) VM-exit in 20 VM-exits, then more than 10 such VM-exits in 200 VM-exits represents sustained badness. (Although, as Sean noted, these numbers are just placeholders.) > > + vcpu->arch.nr_quick_spec_ctrl_writes++; > > + else > > + vcpu->arch.nr_quick_spec_ctrl_writes = 0; > > + > > + vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits; > > + > > + return vcpu->arch.nr_quick_spec_ctrl_writes >= 10; > > +} > > + > > +static inline bool kvm_account_msr_spec_ctrl_passthrough(struct kvm_vcpu *vcpu) > > +{ > > + if ((vcpu->stat.exits - vcpu->arch.spec_ctrl_nr_exits_snapshot) < 100000) > > + return false; > > + > > + vcpu->arch.spec_ctrl_nr_exits_snapshot = vcpu->stat.exits; > > + vcpu->arch.nr_quick_spec_ctrl_writes = 0; > > + return true; > > +} > > + > > bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4); > > int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r, > > struct x86_exception *e); > > > > base-commit: 39428f6ea9eace95011681628717062ff7f5eb5f > > -- >