On Thu, Feb 20, 2025 at 04:55:18PM -0800, Sean Christopherson wrote: > TL;DR: Please don't merge this patch to kvm/next or kvm/queue. > > On Thu, Feb 20, 2025, Paolo Bonzini wrote: > > @@ -72,8 +94,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = { > > .has_emulated_msr = vmx_has_emulated_msr, > > > > .vm_size = sizeof(struct kvm_vmx), > > - .vm_init = vmx_vm_init, > > - .vm_destroy = vmx_vm_destroy, > > + > > + .vm_init = vt_vm_init, > > + .vm_destroy = vt_vm_destroy, > > + .vm_free = vt_vm_free, > > > > .vcpu_precreate = vmx_vcpu_precreate, > > .vcpu_create = vmx_vcpu_create, > > ... > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 374d89e6663f..e0b9b845df58 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -12884,6 +12884,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm) > > kvm_page_track_cleanup(kvm); > > kvm_xen_destroy_vm(kvm); > > kvm_hv_destroy_vm(kvm); > > + static_call_cond(kvm_x86_vm_free)(kvm); > > } > > Sorry to throw a wrench in things, but I have a fix that I want to send for 6.14[1], > i.e. before this code, and to land that fix I need/want to destroy vCPUs before > calling kvm_x86_ops.vm_destroy(). *sigh* > > The underlying issue is that both nVMX and nSVM suck and access all manner of VM-wide > state when destroying a vCPU that is currently in nested guest mode, and I want > to fix the most pressing issue of destroying vCPUs at a random time once and for > all. nVMX and nSVM also need to be cleaned up to not access so much darn state, > but I'm worried that "fixing" the nested cases will only whack the biggest mole. > > Commit 6fcee03df6a1 ("KVM: x86: avoid loading a vCPU after .vm_destroy was called") > papered over an AVIC case, but there are issues, e.g. with the MSR filters[2], > and the NULL pointer deref that's blocking the aforementioned fix is a nVMX access > to the PIC. > > I haven't fully tested destroying vCPUs before calling vm_destroy(), but I can't > see anything in vmx_vm_destroy() or svm_vm_destroy() that expects to run while > vCPUs are still alive. If anything, it's opposite, e.g. freeing VMX's IPIv PID > table before vCPUs are destroyed is blatantly unsafe. Is it possible to simply move the code like freeing PID table from vmx_vm_destroy() to static_call_cond(kvm_x86_vm_free)(kvm) ? > > The good news is, I think it'll lead to a better approach (and naming). KVM already > frees MMU state before vCPU state, because while MMUs are largely VM-scoped, all > of the common MMU state needs to be freed before any one vCPU is freed. > > And so my plan is to carved out a kvm_destroy_mmus() helper, which can then call > the TDX hook to release/reclaim the HKID, which I assume needs to be done after > KVM's general MMU destruction, but before vCPUs are freed. > > I'll make sure to Cc y'all on the series (typing and testing furiously to try and > get it out asap). But to try and avoid posting code that's not usable for TDX, > will this work? > > static void kvm_destroy_mmus(struct kvm *kvm) > { > struct kvm_vcpu *vcpu; > unsigned long i; > > if (current->mm == kvm->mm) { > /* > * Free memory regions allocated on behalf of userspace, > * unless the memory map has changed due to process exit > * or fd copying. > */ > mutex_lock(&kvm->slots_lock); > __x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, > 0, 0); > __x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT, > 0, 0); > __x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0); > mutex_unlock(&kvm->slots_lock); > } > > kvm_for_each_vcpu(i, vcpu, kvm) { > kvm_clear_async_pf_completion_queue(vcpu); > kvm_unload_vcpu_mmu(vcpu); > } > > kvm_x86_call(mmu_destroy)(kvm); I suppose this will hook tdx_mmu_release_hkid() ? > } > > void kvm_arch_pre_destroy_vm(struct kvm *kvm) > { > kvm_mmu_pre_destroy_vm(kvm); > } > > void kvm_arch_destroy_vm(struct kvm *kvm) > { > /* > * WARNING! MMUs must be destroyed before vCPUs, and vCPUs must be > * destroyed before any VM state. Most MMU state is VM-wide, but is > * tracked per-vCPU, and so must be unloaded/freed in its entirety > * before any one vCPU is destroyed. For all other VM state, vCPUs > * expect to be able to access VM state until the vCPU is freed. > */ > kvm_destroy_mmus(kvm); > kvm_destroy_vcpus(kvm); > kvm_x86_call(vm_destroy)(kvm); > > ... > } > > [1] https://lore.kernel.org/all/Z66RC673dzlq2YuA@xxxxxxxxxx > [2] https://lore.kernel.org/all/20240703175618.2304869-2-aaronlewis@xxxxxxxxxx