On Mon, Dec 13, 2021, Paolo Bonzini wrote: > kvm_tdp_mmu_zap_all is intended to visit all roots and zap their page > tables, which flushes the accessed and dirty bits out to the Linux > "struct page"s. Missing some of the roots has catastrophic effects, > because kvm_tdp_mmu_zap_all is called when the MMU notifier is being > removed and any PTEs left behind might become dangling by the time > kvm-arch_destroy_vm tears down the roots for good. > > Unfortunately that is exactly what kvm_tdp_mmu_zap_all is doing: it > visits all roots via for_each_tdp_mmu_root_yield_safe, which in turn > uses kvm_tdp_mmu_get_root to skip invalid roots. If the current root is > invalid at the time of kvm_tdp_mmu_zap_all, its page tables will remain > in place but will later be zapped during kvm_arch_destroy_vm. As stated in the bug report thread[*], it should be impossible as for the MMU notifier to be unregistered while kvm_mmu_zap_all_fast() is running. I do believe there's a race between set_nx_huge_pages() and kvm_mmu_notifier_release(), but that would result in the use-after-free kvm_set_pfn_dirty() tracing back to set_nx_huge_pages(), not kvm_destroy_vm(). And for that, I would much prefer we elevant mm->users while changing the NX hugepage setting. diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 8f0035517450..985df4db8192 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6092,10 +6092,15 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp) mutex_lock(&kvm_lock); list_for_each_entry(kvm, &vm_list, vm_list) { + if (!mmget_not_zero(kvm->mm)) + continue; + mutex_lock(&kvm->slots_lock); kvm_mmu_zap_all_fast(kvm); mutex_unlock(&kvm->slots_lock); + mmput_async(kvm->mm); + wake_up_process(kvm->arch.nx_lpage_recovery_thread); } mutex_unlock(&kvm_lock); [*] https://lore.kernel.org/all/Ybdxd7QcJI71UpHm@xxxxxxxxxx/