Re: [PATCH 20/30] KVM: TDX: create/destroy VM structure

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 21 Feb 2025 11:43:45 -0800

On Fri, Feb 21, 2025, Yan Zhao wrote:
> On Thu, Feb 20, 2025 at 04:55:18PM -0800, Sean Christopherson wrote:
> > TL;DR: Please don't merge this patch to kvm/next or kvm/queue.

...

> > Commit 6fcee03df6a1 ("KVM: x86: avoid loading a vCPU after .vm_destroy was called")
> > papered over an AVIC case, but there are issues, e.g. with the MSR filters[2],
> > and the NULL pointer deref that's blocking the aforementioned fix is a nVMX access
> > to the PIC.
> > 
> > I haven't fully tested destroying vCPUs before calling vm_destroy(), but I can't
> > see anything in vmx_vm_destroy() or svm_vm_destroy() that expects to run while
> > vCPUs are still alive.  If anything, it's opposite, e.g. freeing VMX's IPIv PID
> > table before vCPUs are destroyed is blatantly unsafe.
> Is it possible to simply move the code like freeing PID table from
> vmx_vm_destroy() to static_call_cond(kvm_x86_vm_free)(kvm) ?

That would fix the potential PID table problem, but not the other issues with
freeing VM state before destroying vCPUs.

> > The good news is, I think it'll lead to a better approach (and naming).  KVM already
> > frees MMU state before vCPU state, because while MMUs are largely VM-scoped, all
> > of the common MMU state needs to be freed before any one vCPU is freed.
> > 
> > And so my plan is to carved out a kvm_destroy_mmus() helper, which can then call
> > the TDX hook to release/reclaim the HKID, which I assume needs to be done after
> > KVM's general MMU destruction, but before vCPUs are freed.
> > 
> > I'll make sure to Cc y'all on the series (typing and testing furiously to try and
> > get it out asap).  But to try and avoid posting code that's not usable for TDX,
> > will this work?
> > 
> > static void kvm_destroy_mmus(struct kvm *kvm)
> > {
> > 	struct kvm_vcpu *vcpu;
> > 	unsigned long i;
> > 
> > 	if (current->mm == kvm->mm) {
> > 		/*
> > 		 * Free memory regions allocated on behalf of userspace,
> > 		 * unless the memory map has changed due to process exit
> > 		 * or fd copying.
> > 		 */
> > 		mutex_lock(&kvm->slots_lock);
> > 		__x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT,
> > 					0, 0);
> > 		__x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT,
> > 					0, 0);
> > 		__x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
> > 		mutex_unlock(&kvm->slots_lock);
> > 	}
> > 
> > 	kvm_for_each_vcpu(i, vcpu, kvm) {
> > 		kvm_clear_async_pf_completion_queue(vcpu);
> > 		kvm_unload_vcpu_mmu(vcpu);
> > 	}
> > 
> > 	kvm_x86_call(mmu_destroy)(kvm);
> I suppose this will hook tdx_mmu_release_hkid() ?

Please see my follow-up idea[1] to hook into kvm_arch_pre_destroy_vm().  Turns
out there isn't a hard requirement to unload MMUs prior to destroying vCPUs (at
least, not AFAICT).

[1] https://lore.kernel.org/all/Z7fSIMABm4jp5ADA@xxxxxxxxxx