Re: [PATCH 02/16] KVM: x86/mmu: Introduce a slot flag to zap only slot leafs on slot deletion

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 15 May 2024 13:05:52 -0700

On Wed, May 15, 2024, Rick P Edgecombe wrote:
> On Wed, 2024-05-15 at 12:09 -0700, Sean Christopherson wrote:
> > > It's weird that userspace needs to control how does KVM zap page table for
> > > memslot delete/move.
> > 
> > Yeah, this isn't quite what I had in mind.  Granted, what I had in mind may
> > not be much any better, but I definitely don't want to let userspace
> > dictate exactly how KVM manages SPTEs.
> 
> To me it doesn't seem completely unprecedented at least. Linux has a ton of
> madvise() flags and other knobs to control this kind of PTE management for
> userspace memory.

Yes, but they all express their requests in terms of what behavior userspace wants
or to communicate userspace's access paterns.  They don't dictate exact low level
behavior to the kernel.

> > My thinking for a memslot flag was more of a "deleting this memslot doesn't
> > have side effects", i.e. a way for userspace to give KVM the green light to
> > deviate from KVM's historical behavior of rebuilding the entire page
> > tables.  Under the hood, KVM would be allowed to do whatever it wants, e.g.
> > for the initial implementation, KVM would zap only leafs.  But critically,
> > KVM wouldn't be _required_ to zap only leafs.
> > 
> > > So to me looks it's overkill to expose this "zap-leaf-only" to userspace.
> > > We can just set this flag for a TDX guest when memslot is created in KVM.
> > 
> > 100% agreed from a functionality perspective.  My thoughts/concerns are
> > more about KVM's ABI.
> > 
> > Hmm, actually, we already have new uAPI/ABI in the form of VM types.  What
> > if we squeeze a documentation update into 6.10 (which adds the SEV VM
> > flavors) to state that KVM's historical behavior of blasting all SPTEs is
> > only _guaranteed_ for KVM_X86_DEFAULT_VM?
> > 
> > Anyone know if QEMU deletes shared-only, i.e. non-guest_memfd, memslots
> > during SEV-* boot?  If so, and assuming any such memslots are smallish, we
> > could even start enforcing the new ABI by doing a precise zap for small
> > (arbitrary limit TBD) shared-only memslots for !KVM_X86_DEFAULT_VM VMs.
> 
> Again thinking of the userspace memory analogy... Aren't there some VMs where
> the fast zap is faster? Like if you have guest with a small memslot that gets
> deleted all the time, you could want it to be zapped specifically. But for the
> giant memslot next to it, you might want to do the fast zap all thing.

Yes.  But...

> So rather then try to optimize zapping more someday and hit similar issues, let
> userspace decide how it wants it to be done. I'm not sure of the actual
> performance tradeoffs here, to be clear.

...unless someone is able to root cause the VFIO regression, we don't have the
luxury of letting userspace give KVM a hint as to whether it might be better to
do a precise zap versus a nuke-and-pave.

And more importantly, it would be a _hint_, not the hard requirement that TDX
needs.

> That said, a per-vm know is easier for TDX purposes.