/cast <thread necromancy> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote: > On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote: > > It seems like the problem occurs when the sp->gfn you "continue over" > > includes a userspace-MMIO gfn. But since I have no better ideas right > > now, I'm going to apply the revert (we don't know for sure that it only > > happens with assigned devices). > > After many hours of staring, I've come to the conclusion that > kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e. > a revert is definitely in order for 5.3 and stable. Whelp, past me was wrong. The patch wasn't completely broken, as the rmap zapping piece of kvm_mmu_invalidate_zap_pages_in_memslot() was sufficient to satisfy removal of the memslot. I.e. zapping leaf PTEs (including large pages) should prevent referencing the old memslot, the fact that zapping upper level shadow pages was broken was irrelevant because there should be no need to zap non-leaf PTEs. The one thing that sticks out as potentially concerning is passing %false for @lock_flush_tlb. Dropping mmu_lock during slot_handle_level_range() without flushing would allow a vCPU to create and use a new entry while a different vCPU has the old entry cached in its TLB. I think that could even happen with a single vCPU if the memslot is deleted by a helper task, and the zapped page was a large page that was fractured into small pages when inserted into the TLB. TL;DR: Assuming no other bugs in the kernel, this should be sufficient if the goal is simply to prevent usage of a memslot: static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, struct kvm_page_track_notifier_node *node) { bool flush; spin_lock(&kvm->mmu_lock); flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, true); if (flush) kvm_flush_remote_tlbs(kvm); spin_unlock(& } > mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of > the shadow ptes, e.g. for_each_valid_sp() will find page tables, page > directories, etc... Passing in the raw gfns of the memslot doesn't work > because the gfn isn't properly adjusted/aligned to match how KVM tracks > gfns for shadow pages, e.g. removal of the companion audio memslot that > occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find > the shadow page table containing the relevant sptes. > > This is why Paolo's suggestion to remove slot_handle_all_level() on > kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's > accounting without actually zapping the relevant sptes. This is just straight up wrong, zapping the rmaps does zap the leaf sptes. The BUG() occurs because gfn_to_rmap() works on the _new_ memslots instance, and if a memslot is being deleted there is no memslot for the gfn, hence the NULL pointer bug when mmu_page_zap_pte() attempts to zap a PTE. Zapping the rmaps (leaf/last PTEs) first "fixes" the issue by making it so that mmu_page_zap_pte() will never see a present PTE for a non-existent meslot. I don't think any of this explains the pass-through GPU issue. But, we have a few use cases where zapping the entire MMU is undesirable, so I'm going to retry upstreaming this patch as with per-VM opt-in. I wanted to set the record straight for posterity before doing so.