Re: [PATCH v3 2/4] KVM: SVM: Enable Bus lock threshold exit

Manali Shukla <manali.shukla@xxxxxxx> · Fri, 18 Oct 2024 17:05:22 +0530

Hi Sean,

Thanks for reviewing my patches.

On 10/15/2024 11:19 PM, Sean Christopherson wrote:
> On Fri, Oct 04, 2024, Manali Shukla wrote:
>> When a VMRUN instruction is executed, the bus lock threshold count is
>> loaded into an internal count register. Before the processor executes
>> a bus lock in the guest, it checks the value of this register:
>>
>>  - If the value is greater than '0', the processor successfully
>>    executes the bus lock and decrements the count.
>>
>>  - If the value is '0', the bus lock is not executed, and a #VMEXIT to
>>    the VMM is taken.
>>
>> The bus lock threshold #VMEXIT is reported to the VMM with the VMEXIT
>> code A5h, SVM_EXIT_BUS_LOCK.
> 
> I vote to split this into two patches: one to add the architectural collateral,
> with the above as the changelog, and a second to actually implement support in
> KVM.  Having the above background is useful, but it makes it quite hard to find
> information on the KVM design and implementation.
>  

Sure. I will split it into 2 patches.

>> This implementation is set up to intercept every guest bus lock.
> 
> "This implementation" is a variation on "This patch".  Drop it, and simply state
> what the patch is doing.
> 
>> initial value of the Bus Lock Threshold counter is '0' in the VMCB, so
>> the very first bus lock will exit, set the Bus Lock Threshold counter
>> to '1' (so that the offending instruction can make forward progress)
>> but then the value is at '0' again so the next bus lock will exit.
>>
>> The generated SVM_EXIT_BUS_LOCK in kvm will exit to user space by
> 
> s/kvm/KVM
> 
> And again, the tone is wrong.
> 
> Something is what I would aim for:
> 
>   Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock
>   Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to
>   allow reusing KVM_CAP_X86_BUS_LOCK_EXIT.
> 
>   The biggest difference between the two features is that Threshold is
>   fault-like, whereas Detection is trap-like.  To allow the guest to make
>   forward progress, Threshold provides a per-VMCB counter which is
>   decremented every time a bus lock occurs, and a VM-Exit is triggered if
>   and only if the counter is '0'.
> 
>   To provide Detection-like semantics, initialize the counter to '0', i.e.
>   exit on every bus lock, and when re-executing the guilty instruction, set
>   the counter to '1' to effectively step past the instruction.
> 
>   Note, in the unlikely scenario that re-executing the instruction doesn't
>   trigger a bus lock, e.g. because the guest has changed memory types or
>   patched the guilty instruction, the bus lock counter will be left at '1',
>   i.e. the guest will be able to do a bus lock on a different instruction.
>   In a perfect world, KVM would ensure the counter is '0' if the guest has
>   made forward progress, e.g. if RIP has changed.  But trying to close that
>   hole would incur non-trivial complexity, for marginal benefit; the intent
>   of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks,
>   not to allow for precise detection of problematic guest code.  And, it's
>   simply not feasible to fully close the hole, e.g. if an interrupt arrives
>   before the original instruction can re-execute, the guest could step past
>   a different bus lock.
> 
>> setting the KVM_RUN_BUS_LOCK flag in vcpu->run->flags, indicating to
>> the user space that a bus lock has been detected in the guest.
>>
>> Use the already available KVM capability KVM_CAP_X86_BUS_LOCK_EXIT to
>> enable the feature. This feature is disabled by default, it can be
>> enabled explicitly from user space.
>>
>> More details about the Bus Lock Threshold feature can be found in AMD
>> APM [1].
> 
> ...
> 
>> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
>> index d5314cb7dff4..ca1c42201894 100644
>> --- a/arch/x86/kvm/svm/nested.c
>> +++ b/arch/x86/kvm/svm/nested.c
>> @@ -669,6 +669,11 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm,
>>  	vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa;
>>  	vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
>>  
>> +	/*
>> +	 * The bus lock threshold count should keep running across nested
>> +	 * transitions. Copy the buslock threshold count from vmcb01 to vmcb02.
> 
> No, it should be reset to '0'.  The bus lock can't have been on VMRUN, because KVM
> is emulating the VMRUN.  That leaves two possibilities: the bus lock happened in
> L1 on an instruction before VMRUN, or the bus lock happened in _an_ L2, before a
> nested VM-Exit to L1 occurred.
> 
> In the first case, the bus lock firmly happened on an instruction in the past.
> Even if vmcb01's counter is still '1', e.g. because the guilty instruction got
> patched, the vCPU has clearly made forward progress and so KVM should reset vmcb02's
> counter to '0'.
> 
> In the second case, KVM has no idea if L2 has made forward progress.  The only
> way to _try to_ detect if L2 has made forward progress would to be to track the
> counter on a per-vmcb12 basis, but even that is flawed because KVM doesn't have
> visibility into L1's management of L2.
>  
> I do think we may need to stash vmcb02's counter though.  E.g. if userspace rate-
> limits the vCPU, then it's entirely possible that L1's tick interrupt is pending
> by the time userspace re-runs the vCPU.  If KVM unconditionally clears the counter
> on VMRUN, then when L1 re-enters L2, the same instruction will trigger a VM-Exit
> and the entire cycle starts over.
> 
> Anything we do is going to be imperfect, but the best idea I can come up with is
> also relatively simple, especially in conjunction with my suggestion below.  If
> KVM tracks the RIP that triggered the bus lock, then on nested VM-Exit KVM can
> propagate that RIP into svm_nested_state as appropriate.  E.g.
> 
> 	if (vmcb02->control.bus_lock_counter &&
> 	    svm->bus_lock_rip == vmcb02->save.rip)
> 		svm->nested.bus_lock_rip = svm->bus_lock_rip;
> 	else
> 		svm->nested.bus_lock_rip = INVALID_GVA; /* or '0', as much as that bugs me */
> 
> and then on nested VMRUN
> 
> 	if (svm->nested.bus_lock_rip == vmcb02->save.rip) {
> 		vmcb02->control.bus_lock_counter = 1;
> 		svm->bus_lock_rip = svm->nested.bus_lock_rip;
> 	} else {
> 		vmcb02->control.bus_lock_counter = 0;
> 	}
> 
> 	svm->nested.bus_lock_rip = INVALID_GVA;
> 
> Again, it's imperfect, e.g. if L1 happens to run a different vCPU at the same
> RIP, then KVM will allow a bus lock for the wrong vCPU.  But the odds of that
> happening are absurdly low unless L1 is deliberately doing weird things, and so
> I don't think
> 
>> +	 */
>> +	vmcb02->control.bus_lock_counter = vmcb01->control.bus_lock_counter;
>>  	/* Done at vmrun: asid.  */
>>  
>>  	/* Also overwritten later if necessary.  */
>> @@ -1035,6 +1040,11 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
>>  
>>  	}
>>  
>> +	/*
>> +	 * The bus lock threshold count should keep running across nested
>> +	 * transitions. Copy the buslock threshold count from vmcb02 to vmcb01.
>> +	 */
>> +	vmcb01->control.bus_lock_counter = vmcb02->control.bus_lock_counter;
> 
> KVM should definitely reset the counter to '0' on a nested VM-Exit, because L1
> can't be interrupted by L2, i.e. there is zero chance that KVM needs to allow a
> bus lock in L1 to ensure L1 makes forward progress.
> 
>>  	nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr);
>>  
>>  	svm_switch_vmcb(svm, &svm->vmcb01);
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 9df3e1e5ae81..0d0407f1ee6a 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -1372,6 +1372,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
>>  		svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK;
>>  	}
>>  
>> +	if (vcpu->kvm->arch.bus_lock_detection_enabled)
>> +		svm_set_intercept(svm, INTERCEPT_BUSLOCK);
>> +
>>  	if (sev_guest(vcpu->kvm))
>>  		sev_init_vmcb(svm);
>>  
>> @@ -3286,6 +3289,24 @@ static int invpcid_interception(struct kvm_vcpu *vcpu)
>>  	return kvm_handle_invpcid(vcpu, type, gva);
>>  }
>>  
>> +static int bus_lock_exit(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_svm *svm = to_svm(vcpu);
>> +
>> +	vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;
>> +	vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK;
>> +
>> +	/*
>> +	 * Reload the counter with value greater than '0'.
> 
> The value quite obviously must be exactly '1', not simply greater than '0.  I also
> think this is the wrong place to set the counter.  Rather than set the counter at
> the time of exit, KVM should implement a vcpu->arch.complete_userspace_io callback
> and set the counter to '1' if and only if RIP (or LIP, but I have no objection to
> keeping things simple) is unchanged.  It's a bit of extra complexity, but it will
> make it super obvious why KVM is setting the counter to '1'.  And, if userspace
> wants to stuff state and move past the instruction, e.g. by emulating the guilty
> instruction, then KVM won't unnecessarily allow a bus lock in the guest.
> 
> And then the comment can be:
> 
> 	/*
> 	 * If userspace has NOT change RIP, then KVM's ABI is to let the guest
> 	 * execute the bus-locking instruction.  Set the bus lock counter to '1'
> 	 * to effectively step past the bus lock.
> 	 */
> 

Thank you for highlighting these scenarios (for nested guest and normal guests). 
I had not thought about them. I’m currently going through the comments
and trying to fully understand them. I’ll try out the suggested changes and
get back to you.

- Manali

>> +	 * The bus lock exit on SVM happens with RIP pointing to the guilty
>> +	 * instruction. So, reloading the value of bus_lock_counter to '0'
>> +	 * results into generating continuous bus lock exits.
>> +	 */
>> +	svm->vmcb->control.bus_lock_counter = 1;
>> +
>> +	return 0;
>> +}
>> +
>>  static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>>  	[SVM_EXIT_READ_CR0]			= cr_interception,
>>  	[SVM_EXIT_READ_CR3]			= cr_interception,
>> @@ -3353,6 +3374,7 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>>  	[SVM_EXIT_CR4_WRITE_TRAP]		= cr_trap,
>>  	[SVM_EXIT_CR8_WRITE_TRAP]		= cr_trap,
>>  	[SVM_EXIT_INVPCID]                      = invpcid_interception,
>> +	[SVM_EXIT_BUS_LOCK]			= bus_lock_exit,
>>  	[SVM_EXIT_NPF]				= npf_interception,
>>  	[SVM_EXIT_RSM]                          = rsm_interception,
>>  	[SVM_EXIT_AVIC_INCOMPLETE_IPI]		= avic_incomplete_ipi_interception,
>> @@ -5227,6 +5249,11 @@ static __init void svm_set_cpu_caps(void)
>>  		kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
>>  	}
>>  
>> +	if (cpu_feature_enabled(X86_FEATURE_BUS_LOCK_THRESHOLD)) {
>> +		pr_info("Bus Lock Threashold supported\n");
>> +		kvm_caps.has_bus_lock_exit = true;
>> +	}
>> +
>>  	/* CPUID 0x80000008 */
>>  	if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) ||
>>  	    boot_cpu_has(X86_FEATURE_AMD_SSBD))
>> -- 
>> 2.34.1
>>