Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Tue, 10 Sep 2024 17:16:13 +0200

On Tue, Sep 10, 2024 at 3:58 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> No, because that defeates the purpose of having mmu_lock be a rwlock.

But if this part of the TDX module is wrapped in a single big
try_lock, there's no difference in spinning around busy seamcalls, or
doing spin_lock(&kvm->arch.seamcall_lock). All of them hit contention
in the same way.  With respect to FROZEN_SPTE...

> > This way we know that "busy" errors must come from the guest and have set
> > HOST_PRIORITY.
>
> We should be able to achieve that without a VM-wide spinlock.  My thought (from
> v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep
> it set until the SEAMCALL completes.

Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
which documents that the TDX module returns TDX_OPERAND_BUSY on a
CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
to prevent contention in the TDX module.

If we want to be a bit more optimistic, let's do something more
sophisticated, like only take the lock after the first busy reply. But
the spinlock is the easiest way to completely remove host-induced
TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.

> > It is still kinda bad that guests can force the VMM to loop, but the VMM can
> > always say enough is enough.  In other words, let's assume that a limit of
> > 16 is probably appropriate but we can also increase the limit and crash the
> > VM if things become ridiculous.
>
> 2 :-)
>
> One try that guarantees no other host task is accessing the S-EPT entry, and a
> second try after blasting IPI to kick vCPUs to ensure no guest-side task has
> locked the S-EPT entry.

Fair enough. Though in principle it is possible to race and have the
vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.
So I would make it 5 or so just to be safe.

> My concern with an arbitrary retry loop is that we'll essentially propagate the
> TDX module issues to the broader kernel.  Each of those SEAMCALLs is slooow, so
> retrying even ~20 times could exceed the system's tolerances for scheduling, RCU,
> etc...

How slow are the failed ones? The number of retries is essentially the
cost of successful seamcall / cost of busy seamcall.

If HOST_PRIORITY works, even a not-small-but-not-huge number of
retries would be better than the IPIs. IPIs are not cheap either.

> > For zero step detection, my reading is that it's TDH.VP.ENTER that fails;
> > not any of the MEM seamcalls.  For that one to be resolved, it should be
> > enough to do take and release the mmu_lock back to back, which ensures that
> > all pending critical sections have completed (that is,
> > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then
> > loop.  Adding a vCPU stat for that one is a good idea, too.
>
> As above and in my discussion with Rick, I would prefer to kick vCPUs to force
> forward progress, especially for the zero-step case.  If KVM gets to the point
> where it has retried TDH.VP.ENTER on the same fault so many times that zero-step
> kicks in, then it's time to kick and wait, not keep retrying blindly.

Wait, zero-step detection should _not_ affect TDH.MEM latency. Only
TDH.VP.ENTER is delayed. If it is delayed to the point of failing, we
can do write_lock/write_unlock() in the vCPU entry path.

My issue is that, even if we could make it a bit better by looking at
the TDX module source code, we don't have enough information to make a
good choice.  For now we should start with something _easy_, even if
it may not be the greatest.

Paolo