Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Tue, 10 Sep 2024 15:15:28 +0200

On 9/9/24 23:11, Sean Christopherson wrote:
In general, I am_very_  opposed to blindly retrying an SEPT SEAMCALL, ever.  For
its operations, I'm pretty sure the only sane approach is for KVM to ensure there
will be no contention.  And if the TDX module's single-step protection spuriously
kicks in, KVM exits to userspace.  If the TDX module can't/doesn't/won't communicate
that it's mitigating single-step, e.g. so that KVM can forward the information
to userspace, then that's a TDX module problem to solve.

In principle I agree but we also need to be pragmatic.  Exiting to 
userspace may not be practical in all flows, for example.

First of all, we can add a spinlock around affected seamcalls.  This way 
we know that "busy" errors must come from the guest and have set 
HOST_PRIORITY.  It is still kinda bad that guests can force the VMM to 
loop, but the VMM can always say enough is enough.  In other words, 
let's assume that a limit of 16 is probably appropriate but we can also 
increase the limit and crash the VM if things become ridiculous.

Something like this:

	static u32 max = 16;
	int retry = 0;
	spin_lock(&kvm->arch.seamcall_lock);
	for (;;) {
		args_in = *in;
		ret = seamcall_ret(op, in);
		if (++retry == 1) {
			/* protected by the same seamcall_lock */
			kvm->stat.retried_seamcalls++;
		} else if (retry == READ_ONCE(max)) {
			pr_warn("Exceeded %d retries for S-EPT operation\n", max);
			if (KVM_BUG_ON(kvm, retry == 1024)) {
				pr_err("Crashing due to lock contention in the TDX module\n");
				break;
			}
			cmpxchg(&max, retry, retry * 2);
		}
	}
	spin_unlock(&kvm->arch.seamcall_lock);

This way we can do some testing and figure out a useful limit.

For zero step detection, my reading is that it's TDH.VP.ENTER that 
fails; not any of the MEM seamcalls.  For that one to be resolved, it 
should be enough to do take and release the mmu_lock back to back, which 
ensures that all pending critical sections have completed (that is, 
"write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then 
loop.  Adding a vCPU stat for that one is a good idea, too.

Paolo