Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

Binbin Wu <binbin.wu@xxxxxxxxxxxxxxx> · Fri, 29 Mar 2024 15:25:47 +0800

On 3/29/2024 4:39 AM, Isaku Yamahata wrote:

[...]
How about this?

/*
   * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
   * TDH.MNG.KEY.FREEID() to free the HKID.
   * Other threads can remove pages from TD.  When the HKID is assigned, we need
   * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
   * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free.  Get lock to not
   * present transient state of HKID.
   */
Could you elaborate why it is still possible to have other thread removing
pages from TD?

I am probably missing something, but the thing I don't understand is why
this function is triggered by MMU release?  All the things done in this
function don't seem to be related to MMU at all.
The KVM releases EPT pages on MMU notifier release.  kvm_mmu_zap_all() does. If
we follow that way, kvm_mmu_zap_all() zaps all the Secure-EPTs by
TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().  Because
TDH.MEM.{SEPT, PAGE}.REMOVE() is slow, we can free HKID before kvm_mmu_zap_all()
to use TDH.PHYMEM.PAGE.RECLAIM().
Can you elaborate why TDH.MEM.{SEPT,PAGE}.REMOVE is slower than
TDH.PHYMEM.PAGE.RECLAIM()?

And does the difference matter in practice, i.e. did you see using the former
having noticeable performance downgrade?
Yes. With HKID alive, we have to assume that vcpu can run still. It means TLB
shootdown. The difference is 2 extra SEAMCALL + IPI synchronization for each
guest private page.  If the guest has hundreds of GB, the difference can be
tens of minutes.

With HKID alive, we need to assume vcpu is alive.
- TDH.MEM.PAGE.REMOVE()
- TDH.PHYMEM.PAGE_WBINVD()
- TLB shoot down
   - TDH.MEM.TRACK()
   - IPI to other vcpus
   - wait for other vcpu to exit

Do we have a way to batch the TLB shoot down.
IIUC, in current implementation, TLB shoot down needs to be done for 
each page remove, right?

After freeing HKID
- TDH.PHYMEM.PAGE.RECLAIM()
   We already flushed TLBs and memory cache.

Freeing vcpus is done in
kvm_arch_destroy_vm(), which is _after_ mmu_notifier->release(), in which
this tdx_mmu_release_keyid() is called?
guest memfd complicates things.  The race is between guest memfd release and mmu
notifier release.  kvm_arch_destroy_vm() is called after closing all kvm fds
including guest memfd.

Here is the example.  Let's say, we have fds for vhost, guest_memfd, kvm vcpu,
and kvm vm.  The process is exiting.  Please notice vhost increments the
reference of the mmu to access guest (shared) memory.

exit_mmap():
   Usually mmu notifier release is fired. But not yet because of vhost.

exit_files()
   close vhost fd. vhost starts timer to issue mmput().
Why does it need to start a timer to issue mmput(), but not call mmput()
directly?
That's how vhost implements it.  It's out of KVM control.  Other component or
user space as other thread can get reference to mmu or FDs.  They can keep/free
them as they like.

   close guest_memfd.  kvm_gmem_release() calls kvm_mmu_unmap_gfn_range().
     kvm_mmu_unmap_gfn_range() eventually this calls TDH.MEM.SEPT.REMOVE()
     and TDH.MEM.PAGE.REMOVE().  This takes time because it processes whole
     guest memory. Call kvm_put_kvm() at last.

   During unmapping on behalf of guest memfd, the timer of vhost fires to call
   mmput().  It triggers mmu notifier release.

   Close kvm vcpus/vm. they call kvm_put_kvm().  The last one calls
   kvm_destroy_vm().

It's ideal to free HKID first for efficiency. But KVM doesn't have control on
the order of fds.
Firstly, what kinda performance efficiency gain are we talking about?
2 extra SEAMCALL + IPI sync for each guest private page.  If the guest memory
is hundreds of GB, the difference can be tens of minutes.

We cannot really tell whether it can be justified to use two different methods
to tear down SEPT page because of this.

Even if it's worth to do, it is an optimization, which can/should be done later
after you have put all building blocks together.

That being said, you are putting too many logic in this patch, i.e., it just
doesn't make sense to release TDX keyID in the MMU code path in _this_ patch.
I agree that this patch is too huge, and that we should break it into smaller
patches.

But here we are depending vcpus to be freed before tdx_mmu_release_hkid()?
Not necessarily.
I am wondering when is TDH.VP.FLUSH done?  Supposedly it should be called when
we free vcpus?  But again this means you need to call TDH.MNG.VPFLUSHDONE
_after_ freeing vcpus.  And this  looks conflicting if you make
tdx_mmu_release_keyid() being called from MMU notifier.
tdx_mmu_release_keyid() call it explicitly for all vcpus.