On Fri, Oct 11, 2024 at 05:53:29AM +0800, Edgecombe, Rick P wrote: > On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote: > > > > > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found. > > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total) > > Isn't there a more general scenario: > > vcpu0 vcpu1 > 1. Freezes PTE > 2. External op to do the SEAMCALL > 3. Faults same PTE, hits frozen PTE > 4. Retries N times, triggers zero-step > 5. Finally finishes external op > > Am I missing something? Yes, it's a follow-up discussion of Sean's proposal [1] of having TDX code to retry on RET_PF_RETRY_FROZEN to avoid zero-step. My worry is that merely avoiding entering guest for vCPUs seeing FROZEN_SPTE is not enough to prevent zero-step. The two examples shows zero-step is possible without re-entering guest for FROZEN_SPTE: - The selftest [2]: a single vCPU can fire zero-step when userspace does something wrong (though KVM is correct). - The above case: Nothing wrong in KVM/QEMU, except an extremely unlucky vCPU. [1] https://lore.kernel.org/all/ZuR09EqzU1WbQYGd@xxxxxxxxxx/ [2] https://lore.kernel.org/all/ZvPrqMj1BWrkkwqN@xxxxxxxxxxxxxxxxxxxxxxxxx/ > > > > > Very technically, this shouldn't be possible. The only way for there to be > > contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. > > the > > 6th attempt should succeed, even if the faulting vCPU wasn't the one to create > > the SPTE. > > > > That said, a few thoughts: > > > > 1. Where did we end up on the idea of requiring userspace to pre-fault memory? > > For others reference, I think you are referring to the idea to pre-fault the > entire S-EPT even for GFNs that usually get AUGed, not the mirrored EPT pre- > faulting/PAGE.ADD dance we are already doing. > > The last discussion with Paolo was to resume the retry solution discussion on > the v2 posting because it would be easier "with everything else already > addressed". Also, there was also some discussion that it was not immediately > obvious how prefaulting everything would work for memory hot plug (i.e. memslots > added during runtime). > > > > > 2. The zero-step logic really should have a slightly more conservative > > threshold. > > I have a hard time believing that e.g. 10 attempts would create a side > > channel, > > but 6 attempts is "fine". > > No idea where the threshold came from. I'm not sure if it affects the KVM > design? We can look into it for curiosity sake in either case. > > > > > 3. This would be a good reason to implement a local retry in > > kvm_tdp_mmu_map(). > > Yes, I'm being somewhat hypocritical since I'm so against retrying for the > > S-EPT case, but my objection to retrying for S-EPT is that it _should_ be > > easy > > for KVM to guarantee success. > > > > E.g. for #3, the below (compile tested only) patch should make it impossible > > for > > the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror > > SPTEs > > should never trigger A/D assists, i.e. retry should always succeed. > > I don't see how it addresses the scenario above. More retires could just make it > rarer, but never fix it. Very possible I'm missing something though. I'm also not 100% sure if zero-step must not happen after this change even when KVM/QEMU do nothing wrong.