Hi Sean, Thanks for reviewing my patches. On 10/15/2024 11:19 PM, Sean Christopherson wrote: > On Fri, Oct 04, 2024, Manali Shukla wrote: >> When a VMRUN instruction is executed, the bus lock threshold count is >> loaded into an internal count register. Before the processor executes >> a bus lock in the guest, it checks the value of this register: >> >> - If the value is greater than '0', the processor successfully >> executes the bus lock and decrements the count. >> >> - If the value is '0', the bus lock is not executed, and a #VMEXIT to >> the VMM is taken. >> >> The bus lock threshold #VMEXIT is reported to the VMM with the VMEXIT >> code A5h, SVM_EXIT_BUS_LOCK. > > I vote to split this into two patches: one to add the architectural collateral, > with the above as the changelog, and a second to actually implement support in > KVM. Having the above background is useful, but it makes it quite hard to find > information on the KVM design and implementation. > Sure. I will split it into 2 patches. >> This implementation is set up to intercept every guest bus lock. > > "This implementation" is a variation on "This patch". Drop it, and simply state > what the patch is doing. > >> initial value of the Bus Lock Threshold counter is '0' in the VMCB, so >> the very first bus lock will exit, set the Bus Lock Threshold counter >> to '1' (so that the offending instruction can make forward progress) >> but then the value is at '0' again so the next bus lock will exit. >> >> The generated SVM_EXIT_BUS_LOCK in kvm will exit to user space by > > s/kvm/KVM > > And again, the tone is wrong. > > Something is what I would aim for: > > Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock > Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to > allow reusing KVM_CAP_X86_BUS_LOCK_EXIT. > > The biggest difference between the two features is that Threshold is > fault-like, whereas Detection is trap-like. To allow the guest to make > forward progress, Threshold provides a per-VMCB counter which is > decremented every time a bus lock occurs, and a VM-Exit is triggered if > and only if the counter is '0'. > > To provide Detection-like semantics, initialize the counter to '0', i.e. > exit on every bus lock, and when re-executing the guilty instruction, set > the counter to '1' to effectively step past the instruction. > > Note, in the unlikely scenario that re-executing the instruction doesn't > trigger a bus lock, e.g. because the guest has changed memory types or > patched the guilty instruction, the bus lock counter will be left at '1', > i.e. the guest will be able to do a bus lock on a different instruction. > In a perfect world, KVM would ensure the counter is '0' if the guest has > made forward progress, e.g. if RIP has changed. But trying to close that > hole would incur non-trivial complexity, for marginal benefit; the intent > of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks, > not to allow for precise detection of problematic guest code. And, it's > simply not feasible to fully close the hole, e.g. if an interrupt arrives > before the original instruction can re-execute, the guest could step past > a different bus lock. > >> setting the KVM_RUN_BUS_LOCK flag in vcpu->run->flags, indicating to >> the user space that a bus lock has been detected in the guest. >> >> Use the already available KVM capability KVM_CAP_X86_BUS_LOCK_EXIT to >> enable the feature. This feature is disabled by default, it can be >> enabled explicitly from user space. >> >> More details about the Bus Lock Threshold feature can be found in AMD >> APM [1]. > > ... > >> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c >> index d5314cb7dff4..ca1c42201894 100644 >> --- a/arch/x86/kvm/svm/nested.c >> +++ b/arch/x86/kvm/svm/nested.c >> @@ -669,6 +669,11 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, >> vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa; >> vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa; >> >> + /* >> + * The bus lock threshold count should keep running across nested >> + * transitions. Copy the buslock threshold count from vmcb01 to vmcb02. > > No, it should be reset to '0'. The bus lock can't have been on VMRUN, because KVM > is emulating the VMRUN. That leaves two possibilities: the bus lock happened in > L1 on an instruction before VMRUN, or the bus lock happened in _an_ L2, before a > nested VM-Exit to L1 occurred. > > In the first case, the bus lock firmly happened on an instruction in the past. > Even if vmcb01's counter is still '1', e.g. because the guilty instruction got > patched, the vCPU has clearly made forward progress and so KVM should reset vmcb02's > counter to '0'. > > In the second case, KVM has no idea if L2 has made forward progress. The only > way to _try to_ detect if L2 has made forward progress would to be to track the > counter on a per-vmcb12 basis, but even that is flawed because KVM doesn't have > visibility into L1's management of L2. > > I do think we may need to stash vmcb02's counter though. E.g. if userspace rate- > limits the vCPU, then it's entirely possible that L1's tick interrupt is pending > by the time userspace re-runs the vCPU. If KVM unconditionally clears the counter > on VMRUN, then when L1 re-enters L2, the same instruction will trigger a VM-Exit > and the entire cycle starts over. > > Anything we do is going to be imperfect, but the best idea I can come up with is > also relatively simple, especially in conjunction with my suggestion below. If > KVM tracks the RIP that triggered the bus lock, then on nested VM-Exit KVM can > propagate that RIP into svm_nested_state as appropriate. E.g. > > if (vmcb02->control.bus_lock_counter && > svm->bus_lock_rip == vmcb02->save.rip) > svm->nested.bus_lock_rip = svm->bus_lock_rip; > else > svm->nested.bus_lock_rip = INVALID_GVA; /* or '0', as much as that bugs me */ > > and then on nested VMRUN > > if (svm->nested.bus_lock_rip == vmcb02->save.rip) { > vmcb02->control.bus_lock_counter = 1; > svm->bus_lock_rip = svm->nested.bus_lock_rip; > } else { > vmcb02->control.bus_lock_counter = 0; > } > > svm->nested.bus_lock_rip = INVALID_GVA; > > Again, it's imperfect, e.g. if L1 happens to run a different vCPU at the same > RIP, then KVM will allow a bus lock for the wrong vCPU. But the odds of that > happening are absurdly low unless L1 is deliberately doing weird things, and so > I don't think > >> + */ >> + vmcb02->control.bus_lock_counter = vmcb01->control.bus_lock_counter; >> /* Done at vmrun: asid. */ >> >> /* Also overwritten later if necessary. */ >> @@ -1035,6 +1040,11 @@ int nested_svm_vmexit(struct vcpu_svm *svm) >> >> } >> >> + /* >> + * The bus lock threshold count should keep running across nested >> + * transitions. Copy the buslock threshold count from vmcb02 to vmcb01. >> + */ >> + vmcb01->control.bus_lock_counter = vmcb02->control.bus_lock_counter; > > KVM should definitely reset the counter to '0' on a nested VM-Exit, because L1 > can't be interrupted by L2, i.e. there is zero chance that KVM needs to allow a > bus lock in L1 to ensure L1 makes forward progress. > >> nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr); >> >> svm_switch_vmcb(svm, &svm->vmcb01); >> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c >> index 9df3e1e5ae81..0d0407f1ee6a 100644 >> --- a/arch/x86/kvm/svm/svm.c >> +++ b/arch/x86/kvm/svm/svm.c >> @@ -1372,6 +1372,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu) >> svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK; >> } >> >> + if (vcpu->kvm->arch.bus_lock_detection_enabled) >> + svm_set_intercept(svm, INTERCEPT_BUSLOCK); >> + >> if (sev_guest(vcpu->kvm)) >> sev_init_vmcb(svm); >> >> @@ -3286,6 +3289,24 @@ static int invpcid_interception(struct kvm_vcpu *vcpu) >> return kvm_handle_invpcid(vcpu, type, gva); >> } >> >> +static int bus_lock_exit(struct kvm_vcpu *vcpu) >> +{ >> + struct vcpu_svm *svm = to_svm(vcpu); >> + >> + vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK; >> + vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK; >> + >> + /* >> + * Reload the counter with value greater than '0'. > > The value quite obviously must be exactly '1', not simply greater than '0. I also > think this is the wrong place to set the counter. Rather than set the counter at > the time of exit, KVM should implement a vcpu->arch.complete_userspace_io callback > and set the counter to '1' if and only if RIP (or LIP, but I have no objection to > keeping things simple) is unchanged. It's a bit of extra complexity, but it will > make it super obvious why KVM is setting the counter to '1'. And, if userspace > wants to stuff state and move past the instruction, e.g. by emulating the guilty > instruction, then KVM won't unnecessarily allow a bus lock in the guest. > > And then the comment can be: > > /* > * If userspace has NOT change RIP, then KVM's ABI is to let the guest > * execute the bus-locking instruction. Set the bus lock counter to '1' > * to effectively step past the bus lock. > */ > Thank you for highlighting these scenarios (for nested guest and normal guests). I had not thought about them. I’m currently going through the comments and trying to fully understand them. I’ll try out the suggested changes and get back to you. - Manali >> + * The bus lock exit on SVM happens with RIP pointing to the guilty >> + * instruction. So, reloading the value of bus_lock_counter to '0' >> + * results into generating continuous bus lock exits. >> + */ >> + svm->vmcb->control.bus_lock_counter = 1; >> + >> + return 0; >> +} >> + >> static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { >> [SVM_EXIT_READ_CR0] = cr_interception, >> [SVM_EXIT_READ_CR3] = cr_interception, >> @@ -3353,6 +3374,7 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { >> [SVM_EXIT_CR4_WRITE_TRAP] = cr_trap, >> [SVM_EXIT_CR8_WRITE_TRAP] = cr_trap, >> [SVM_EXIT_INVPCID] = invpcid_interception, >> + [SVM_EXIT_BUS_LOCK] = bus_lock_exit, >> [SVM_EXIT_NPF] = npf_interception, >> [SVM_EXIT_RSM] = rsm_interception, >> [SVM_EXIT_AVIC_INCOMPLETE_IPI] = avic_incomplete_ipi_interception, >> @@ -5227,6 +5249,11 @@ static __init void svm_set_cpu_caps(void) >> kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK); >> } >> >> + if (cpu_feature_enabled(X86_FEATURE_BUS_LOCK_THRESHOLD)) { >> + pr_info("Bus Lock Threashold supported\n"); >> + kvm_caps.has_bus_lock_exit = true; >> + } >> + >> /* CPUID 0x80000008 */ >> if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || >> boot_cpu_has(X86_FEATURE_AMD_SSBD)) >> -- >> 2.34.1 >>