Re: [PATCH v2 1/4] KVM: x86/mmu: Introduce a quirk to control memslot zap behavior

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 24 Sep 2024 00:45:52 -0700

On Tue, Sep 24, 2024, Yan Zhao wrote:
> On Mon, Sep 23, 2024 at 11:37:14AM -0700, Sean Christopherson wrote:
> > > +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
> > > +{
> > > +	struct kvm_gfn_range range = {
> > > +		.slot = slot,
> > > +		.start = slot->base_gfn,
> > > +		.end = slot->base_gfn + slot->npages,
> > > +		.may_block = true,
> > > +	};
> > > +	bool flush = false;
> > > +
> > > +	write_lock(&kvm->mmu_lock);
> > > +
> > > +	if (kvm_memslots_have_rmaps(kvm))
> > > +		flush = kvm_handle_gfn_range(kvm, &range, kvm_zap_rmap);
> > 
> > This, and Paolo's merged variant, break shadow paging.  As was tried in commit
> > 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"),
> > all shadow pages, i.e. non-leaf SPTEs, need to be zapped.  All of the accounting
> > for a shadow page is tied to the memslot, i.e. the shadow page holds a reference
> > to the memslot, for all intents and purposes.  Deleting the memslot without removing
> > all relevant shadow pages results in NULL pointer derefs when tearing down the VM.
> > 
> > Note, that commit is/was buggy, and I suspect my follow-up attempt[*] was as well.
> > https://lore.kernel.org/all/20190820200318.GA15808@xxxxxxxxxxxxxxx
> > 
> > Rather than trying to get this functional for shadow paging (which includes nested
> > TDP), I think we should scrap the quirk idea and simply make this the behavior for
> > S-EPT and nothing else.
> Ok. Thanks for identifying this error. Will change code to this way.

For now, I think a full revert of the entire series makes sense.  Irrespective of
this bug, I don't think KVM should commit to specific implementation behavior,
i.e. KVM shouldn't explicitly say only leaf SPTEs are zapped.  The quirk docs
should instead say that if the quirk is disabled, KVM will only guarantee that
the delete memslot will be inaccessible.  That way, KVM can still do a fast zap
when it makes sense, e.g. for large memslots, do a complete fast zap, and for
small memslots, do a targeted zap of the TDP MMU but a fast zap of any shadow
MMUs.

> BTW: update some findings regarding to the previous bug with Nvidia GPU
> assignment:
> I found that after v5.19-rc1+, even with nx_huge_pages=N, the bug is not
> reproducible when only leaf entries of memslot are zapped.
> (no more detailed info due to limited time to debug).

Heh, I've given up hope on ever finding a root cause for that issue.