On Mon, Dec 13, 2021, Sean Christopherson wrote: > On Mon, Dec 13, 2021, Paolo Bonzini wrote: > > kvm_tdp_mmu_zap_all is intended to visit all roots and zap their page > > tables, which flushes the accessed and dirty bits out to the Linux > > "struct page"s. Missing some of the roots has catastrophic effects, > > because kvm_tdp_mmu_zap_all is called when the MMU notifier is being > > removed and any PTEs left behind might become dangling by the time > > kvm-arch_destroy_vm tears down the roots for good. > > > > Unfortunately that is exactly what kvm_tdp_mmu_zap_all is doing: it > > visits all roots via for_each_tdp_mmu_root_yield_safe, which in turn > > uses kvm_tdp_mmu_get_root to skip invalid roots. If the current root is > > invalid at the time of kvm_tdp_mmu_zap_all, its page tables will remain > > in place but will later be zapped during kvm_arch_destroy_vm. > > As stated in the bug report thread[*], it should be impossible as for the MMU > notifier to be unregistered while kvm_mmu_zap_all_fast() is running. > > I do believe there's a race between set_nx_huge_pages() and kvm_mmu_notifier_release(), > but that would result in the use-after-free kvm_set_pfn_dirty() tracing back to > set_nx_huge_pages(), not kvm_destroy_vm(). And for that, I would much prefer we > elevant mm->users while changing the NX hugepage setting. Mwhahaha, race confirmed with a bit of hacking to force the issue. I'll get a patch out.