On Tue, Sep 10, 2024 at 3:58 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > On Tue, Sep 10, 2024, Paolo Bonzini wrote: > No, because that defeates the purpose of having mmu_lock be a rwlock. But if this part of the TDX module is wrapped in a single big try_lock, there's no difference in spinning around busy seamcalls, or doing spin_lock(&kvm->arch.seamcall_lock). All of them hit contention in the same way. With respect to FROZEN_SPTE... > > This way we know that "busy" errors must come from the guest and have set > > HOST_PRIORITY. > > We should be able to achieve that without a VM-wide spinlock. My thought (from > v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep > it set until the SEAMCALL completes. Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3, which documents that the TDX module returns TDX_OPERAND_BUSY on a CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough to prevent contention in the TDX module. If we want to be a bit more optimistic, let's do something more sophisticated, like only take the lock after the first busy reply. But the spinlock is the easiest way to completely remove host-induced TDX_OPERAND_BUSY, and only have to deal with guest-induced ones. > > It is still kinda bad that guests can force the VMM to loop, but the VMM can > > always say enough is enough. In other words, let's assume that a limit of > > 16 is probably appropriate but we can also increase the limit and crash the > > VM if things become ridiculous. > > 2 :-) > > One try that guarantees no other host task is accessing the S-EPT entry, and a > second try after blasting IPI to kick vCPUs to ensure no guest-side task has > locked the S-EPT entry. Fair enough. Though in principle it is possible to race and have the vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call. So I would make it 5 or so just to be safe. > My concern with an arbitrary retry loop is that we'll essentially propagate the > TDX module issues to the broader kernel. Each of those SEAMCALLs is slooow, so > retrying even ~20 times could exceed the system's tolerances for scheduling, RCU, > etc... How slow are the failed ones? The number of retries is essentially the cost of successful seamcall / cost of busy seamcall. If HOST_PRIORITY works, even a not-small-but-not-huge number of retries would be better than the IPIs. IPIs are not cheap either. > > For zero step detection, my reading is that it's TDH.VP.ENTER that fails; > > not any of the MEM seamcalls. For that one to be resolved, it should be > > enough to do take and release the mmu_lock back to back, which ensures that > > all pending critical sections have completed (that is, > > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);"). And then > > loop. Adding a vCPU stat for that one is a good idea, too. > > As above and in my discussion with Rick, I would prefer to kick vCPUs to force > forward progress, especially for the zero-step case. If KVM gets to the point > where it has retried TDH.VP.ENTER on the same fault so many times that zero-step > kicks in, then it's time to kick and wait, not keep retrying blindly. Wait, zero-step detection should _not_ affect TDH.MEM latency. Only TDH.VP.ENTER is delayed. If it is delayed to the point of failing, we can do write_lock/write_unlock() in the vCPU entry path. My issue is that, even if we could make it a bit better by looking at the TDX module source code, we don't have enough information to make a good choice. For now we should start with something _easy_, even if it may not be the greatest. Paolo