On Mon, Sep 23, 2024 at 11:37:14AM -0700, Sean Christopherson wrote: > On Wed, Jul 03, 2024, Yan Zhao wrote: > > Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select > > KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM > > VMs. Make sure KVM behave as if the quirk is always disabled for > > non-KVM_X86_DEFAULT_VM VMs. > > ... > > > Suggested-by: Kai Huang <kai.huang@xxxxxxxxx> > > Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx> > > Bad Sean, bad. > > > +/* > > + * Zapping leaf SPTEs with memslot range when a memslot is moved/deleted. > > + * > > + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst > > + * case scenario we'll have unused shadow pages lying around until they > > + * are recycled due to age or when the VM is destroyed. > > + */ > > +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot) > > +{ > > + struct kvm_gfn_range range = { > > + .slot = slot, > > + .start = slot->base_gfn, > > + .end = slot->base_gfn + slot->npages, > > + .may_block = true, > > + }; > > + bool flush = false; > > + > > + write_lock(&kvm->mmu_lock); > > + > > + if (kvm_memslots_have_rmaps(kvm)) > > + flush = kvm_handle_gfn_range(kvm, &range, kvm_zap_rmap); > > This, and Paolo's merged variant, break shadow paging. As was tried in commit > 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"), > all shadow pages, i.e. non-leaf SPTEs, need to be zapped. All of the accounting > for a shadow page is tied to the memslot, i.e. the shadow page holds a reference > to the memslot, for all intents and purposes. Deleting the memslot without removing > all relevant shadow pages results in NULL pointer derefs when tearing down the VM. > > Note, that commit is/was buggy, and I suspect my follow-up attempt[*] was as well. > https://lore.kernel.org/all/20190820200318.GA15808@xxxxxxxxxxxxxxx > > Rather than trying to get this functional for shadow paging (which includes nested > TDP), I think we should scrap the quirk idea and simply make this the behavior for > S-EPT and nothing else. Ok. Thanks for identifying this error. Will change code to this way. BTW: update some findings regarding to the previous bug with Nvidia GPU assignment: I found that after v5.19-rc1+, even with nx_huge_pages=N, the bug is not reproducible when only leaf entries of memslot are zapped. (no more detailed info due to limited time to debug). > > BUG: kernel NULL pointer dereference, address: 00000000000000b0 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 6085f43067 P4D 608c080067 PUD 608c081067 PMD 0 > Oops: Oops: 0000 [#1] SMP NOPTI > CPU: 79 UID: 0 PID: 187063 Comm: set_memory_regi Tainted: G W 6.11.0-smp--24867312d167-cpl #395 > Tainted: [W]=WARN > Hardware name: Google Astoria/astoria, BIOS 0.20240617.0-0 06/17/2024 > RIP: 0010:__kvm_mmu_prepare_zap_page+0x3a9/0x7b0 [kvm] > Code: <48> 8b 8e b0 00 00 00 48 8b 96 e0 00 00 00 48 c1 e9 09 48 29 c8 8b > RSP: 0018:ff314a25b19f7c28 EFLAGS: 00010212 > Call Trace: > <TASK> > kvm_arch_flush_shadow_all+0x7a/0xf0 [kvm] > kvm_mmu_notifier_release+0x6c/0xb0 [kvm] > mmu_notifier_unregister+0x85/0x140 > kvm_put_kvm+0x263/0x410 [kvm] > kvm_vm_release+0x21/0x30 [kvm] > __fput+0x8d/0x2c0 > __se_sys_close+0x71/0xc0 > do_syscall_64+0x83/0x160 > entry_SYSCALL_64_after_hwframe+0x76/0x7e