Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Fri, 26 Jun 2020 10:32:50 -0700

/cast <thread necromancy>

On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
> On Mon, Aug 19, 2019 at 06:03:05PM +0200, Paolo Bonzini wrote:
> > It seems like the problem occurs when the sp->gfn you "continue over"
> > includes a userspace-MMIO gfn.  But since I have no better ideas right
> > now, I'm going to apply the revert (we don't know for sure that it only
> > happens with assigned devices).
> 
> After many hours of staring, I've come to the conclusion that
> kvm_mmu_invalidate_zap_pages_in_memslot() is completely broken, i.e.
> a revert is definitely in order for 5.3 and stable.

Whelp, past me was wrong.  The patch wasn't completely broken, as the rmap
zapping piece of kvm_mmu_invalidate_zap_pages_in_memslot() was sufficient
to satisfy removal of the memslot.  I.e. zapping leaf PTEs (including large
pages) should prevent referencing the old memslot, the fact that zapping
upper level shadow pages was broken was irrelevant because there should be
no need to zap non-leaf PTEs.

The one thing that sticks out as potentially concerning is passing %false
for @lock_flush_tlb.  Dropping mmu_lock during slot_handle_level_range()
without flushing would allow a vCPU to create and use a new entry while a
different vCPU has the old entry cached in its TLB.  I think that could
even happen with a single vCPU if the memslot is deleted by a helper task,
and the zapped page was a large page that was fractured into small pages
when inserted into the TLB.

TL;DR: Assuming no other bugs in the kernel, this should be sufficient if
the goal is simply to prevent usage of a memslot:

static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
			struct kvm_memory_slot *slot,
			struct kvm_page_track_notifier_node *node)
{
	bool flush;

	spin_lock(&kvm->mmu_lock);
	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, true);
	if (flush)
		kvm_flush_remote_tlbs(kvm);
	spin_unlock(&
}

> mmu_page_hash is indexed by the gfn of the shadow pages, not the gfn of
> the shadow ptes, e.g. for_each_valid_sp() will find page tables, page
> directories, etc...  Passing in the raw gfns of the memslot doesn't work
> because the gfn isn't properly adjusted/aligned to match how KVM tracks
> gfns for shadow pages, e.g. removal of the companion audio memslot that
> occupies gfns 0xc1080 - 0xc1083 would need to query gfn 0xc1000 to find
> the shadow page table containing the relevant sptes.
> 
> This is why Paolo's suggestion to remove slot_handle_all_level() on
> kvm_zap_rmapp() caused a BUG(), as zapping the rmaps cleaned up KVM's
> accounting without actually zapping the relevant sptes.

This is just straight up wrong, zapping the rmaps does zap the leaf sptes.
The BUG() occurs because gfn_to_rmap() works on the _new_ memslots instance,
and if a memslot is being deleted there is no memslot for the gfn, hence the
NULL pointer bug when mmu_page_zap_pte() attempts to zap a PTE.  Zapping the
rmaps (leaf/last PTEs) first "fixes" the issue by making it so that
mmu_page_zap_pte() will never see a present PTE for a non-existent meslot.

I don't think any of this explains the pass-through GPU issue.  But, we
have a few use cases where zapping the entire MMU is undesirable, so I'm
going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
set the record straight for posterity before doing so.