On Fri, Sep 13, 2024, Rick P Edgecombe wrote: > On Fri, 2024-09-13 at 10:23 -0700, Sean Christopherson wrote: > > > TL;DR: > > > - tdh_mem_track() can contend with tdh_vp_enter(). > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying > > to I am getting a feeling of deja vu. Please fix your mail client to not generate newlines in the middle of quoted text. > > install a page on behalf of two vCPUs, and KVM resumes the guest if it > > encounters a FROZEN_SPTE when building the non-leaf SPTEs, then one of the > > vCPUs could trigger the zero-step mitigation if the vCPU that "wins" and > > gets delayed for whatever reason. > > Can you explain more about what the concern is here? That the zero-step > mitigation activation will be a drag on the TD because of extra contention with > the TDH.MEM calls? > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow > > slow-path, what if instead of resuming the guest if a page fault hits > > FROZEN_SPTE, KVM retries the fault "locally", i.e. _without_ redoing > > tdh_vp_enter() to see if the vCPU still hits the fault? > > It seems like an optimization. To me, I would normally want to know how much it > helped before adding it. But if you think it's an obvious win I'll defer. I'm not worried about any performance hit with zero-step, I'm worried about KVM not being able to differentiate between a KVM bug and guest interference. The goal with a local retry is to make it so that KVM _never_ triggers zero-step, unless there is a bug somewhere. At that point, if zero-step fires, KVM can report the error to userspace instead of trying to suppress guest activity, and potentially from other KVM tasks too. It might even be simpler overall too. E.g. report status up the call chain and let the top-level TDX S-EPT handler to do its thing, versus adding various flags and control knobs to ensure a vCPU can make forward progress.