On Thu, Mar 28, 2024 at 02:49:56PM +1300, "Huang, Kai" <kai.huang@xxxxxxxxx> wrote: > > > On 28/03/2024 11:53 am, Isaku Yamahata wrote: > > On Tue, Mar 26, 2024 at 02:43:54PM +1300, > > "Huang, Kai" <kai.huang@xxxxxxxxx> wrote: > > > > > ... continue the previous review ... > > > > > > > + > > > > +static void tdx_reclaim_control_page(unsigned long td_page_pa) > > > > +{ > > > > + WARN_ON_ONCE(!td_page_pa); > > > > > > From the name 'td_page_pa' we cannot tell whether it is a control page, but > > > this function is only intended for control page AFAICT, so perhaps a more > > > specific name. > > > > > > > + > > > > + /* > > > > + * TDCX are being reclaimed. TDX module maps TDCX with HKID > > > > > > "are" -> "is". > > > > > > Are you sure it is TDCX, but not TDCS? > > > > > > AFAICT TDCX is the control structure for 'vcpu', but here you are handling > > > the control structure for the VM. > > > > TDCS, TDVPR, and TDCX. Will update the comment. > > But TDCX, TDVPR are vcpu-scoped. Do you want to mention them _here_? So I'll make the patch that frees TDVPR, TDCX will change this comment. > Otherwise you will have to explain them. > > [...] > > > > > + > > > > +void tdx_mmu_release_hkid(struct kvm *kvm) > > > > +{ > > > > + bool packages_allocated, targets_allocated; > > > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); > > > > + cpumask_var_t packages, targets; > > > > + u64 err; > > > > + int i; > > > > + > > > > + if (!is_hkid_assigned(kvm_tdx)) > > > > + return; > > > > + > > > > + if (!is_td_created(kvm_tdx)) { > > > > + tdx_hkid_free(kvm_tdx); > > > > + return; > > > > + } > > > > > > I lost tracking what does "td_created()" mean. > > > > > > I guess it means: KeyID has been allocated to the TDX guest, but not yet > > > programmed/configured. > > > > > > Perhaps add a comment to remind the reviewer? > > > > As Chao suggested, will introduce state machine for vm and vcpu. > > > > https://lore.kernel.org/kvm/ZfvI8t7SlfIsxbmT@chao-email/ > > Could you elaborate what will the state machine look like? > > I need to understand it. Not yet. Chao only propose to introduce state machine. Right now it's just an idea. > > How about this? > > > > /* > > * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and > > * TDH.MNG.KEY.FREEID() to free the HKID. > > * Other threads can remove pages from TD. When the HKID is assigned, we need > > * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE(). > > * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free. Get lock to not > > * present transient state of HKID. > > */ > > Could you elaborate why it is still possible to have other thread removing > pages from TD? > > I am probably missing something, but the thing I don't understand is why > this function is triggered by MMU release? All the things done in this > function don't seem to be related to MMU at all. The KVM releases EPT pages on MMU notifier release. kvm_mmu_zap_all() does. If we follow that way, kvm_mmu_zap_all() zaps all the Secure-EPTs by TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE(). Because TDH.MEM.{SEPT, PAGE}.REMOVE() is slow, we can free HKID before kvm_mmu_zap_all() to use TDH.PHYMEM.PAGE.RECLAIM(). > IIUC, by reaching here, you must already have done VPFLUSHDONE, which should > be called when you free vcpu? Not necessarily. > Freeing vcpus is done in > kvm_arch_destroy_vm(), which is _after_ mmu_notifier->release(), in which > this tdx_mmu_release_keyid() is called? guest memfd complicates things. The race is between guest memfd release and mmu notifier release. kvm_arch_destroy_vm() is called after closing all kvm fds including guest memfd. Here is the example. Let's say, we have fds for vhost, guest_memfd, kvm vcpu, and kvm vm. The process is exiting. Please notice vhost increments the reference of the mmu to access guest (shared) memory. exit_mmap(): Usually mmu notifier release is fired. But not yet because of vhost. exit_files() close vhost fd. vhost starts timer to issue mmput(). close guest_memfd. kvm_gmem_release() calls kvm_mmu_unmap_gfn_range(). kvm_mmu_unmap_gfn_range() eventually this calls TDH.MEM.SEPT.REMOVE() and TDH.MEM.PAGE.REMOVE(). This takes time because it processes whole guest memory. Call kvm_put_kvm() at last. During unmapping on behalf of guest memfd, the timer of vhost fires to call mmput(). It triggers mmu notifier release. Close kvm vcpus/vm. they call kvm_put_kvm(). The last one calls kvm_destroy_vm(). It's ideal to free HKID first for efficiency. But KVM doesn't have control on the order of fds. > But here we are depending vcpus to be freed before tdx_mmu_release_hkid()? Not necessarily. -- Isaku Yamahata <isaku.yamahata@xxxxxxxxx>