Re: [PATCH] KVM: move memslot invalidation later than possible failures

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Thu, 10 Nov 2022 08:50:10 +0800



On Thu, Nov 10, 2022 at 12:33:25AM +0000, Sean Christopherson wrote:
> On Wed, Nov 09, 2022, Yan Zhao wrote:
> > On Tue, Nov 08, 2022 at 05:32:50PM +0000, Sean Christopherson wrote:
> > > On Tue, Nov 08, 2022, Yan Zhao wrote:
> > > > For memslot delete and move, kvm_invalidate_memslot() is required before
> > > > the real changes committed.
> > > > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call
> > > > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in
> > > > arch x86.
> > > > And according to the definition in kvm_page_track_notifier_node, users can
> > > > drop write-protection for the pages in the memory slot on receiving
> > > > .track_flush_slot.
> > > 
> > > Ugh, that's a terrible API.  The entire page track API is a mess, e.g. KVMGT is
> > > forced to grab its own references to KVM and also needs to manually acquire/release
> > > mmu_lock in some flows but not others.
> > > 
> > > Anyways, this is a flaw in the page track API that should be fixed.  Flushing a
> > > slot should not be overloaded to imply "this slot is gone", it should be a flush
> > > command, no more, no less.
> > hmm, but KVM also registers to the page track
> > "node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;"
> > And in kvm_mmu_invalidate_zap_pages_in_memslot, memslot (actaully all the shadow
> > page tables) is zapped. And during the zap, unaccount_shadowed() will drop the
> > write protection. But KVM is ok because the dropped write-protection can be
> > rebuilt during rebuilding the shadow page table.
> > So, for .track_flush_slot, it's expected that "users can drop write-protection
> > for the pages in the memory slot", right?
> 
> No.  KVM isn't actually dropping write-protection, because for the internal KVM
> case, KVM obliterates all of its page tables.
> 
> > > AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter
> > > of adding another hook that's invoked when the memory region change is committed.
> > >
> > Do you mean adding a new hook in page track, e.g. .track_slot_change?
> > Then right before committing slot changes, call this interface to notify slot
> > DELETE/MOVE?
> > 
> > Not only KVMGT, KVM can also be affected by this failure to MOVE if it wants to
> > support below usecase:
> > 1. KVM pre-populated a memslot
> > 2. memslot MOVE happened
> > 3. KVM pre-populates the new memslot (MOVE succeeds) or the old one (MOVE fails).
> > So also add a new interface for the MOVE failure?
> > 
> > > That would allow KVMGT to fix another bug.  If a memory region is moved and the
> > > new region partially overlaps the old region, KVMGT technically probably wants to
> > > retain its write-protection scheme.  Though that's probably not worth supporting,
> > > might be better to say "don't move memory regions if KVMGT is enabled", because
> > > AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was
> > > broken for years and no one noticed).
> > >
> > So, could we disable MOVE in KVM at all?
> 
> Ideally, yes.  Practically?  Dunno.  It's difficult to know whether or not there
> are users out there.
> 
> > > Actually, given that MOVE + KVMGT probably doesn't work regardless of where the
> > > region is moved to, maybe we can get away with rejecting MOVE if an external
> > > page tracker cares about the slot in question.
> > > 
> > > > However, if kvm_prepare_memory_region() fails, the later
> > > > kvm_activate_memslot() will only swap back the original slot, leaving
> > > > previous write protection not recovered.
> > > > 
> > > > This may not be a problem for kvm itself as a page tracker user, but may
> > > > cause problem to other page tracker users, e.g. kvmgt, whose
> > > > write-protected pages are removed from the write-protected list and not
> > > > added back.
> > > > 
> > > > So call kvm_prepare_memory_region first for meta data preparation before
> > > > the slot invalidation so as to avoid failure and recovery.
> > > 
> > > IIUC, this is purely a theoretical bug and practically speaking can't be a problem
> > > in practice, at least not without completely breaking KVMGT.
> > > 
> > > On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor
> > > protected VM case is the only scenario where DELETE can fail on any architecture).
> > > The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for
> > > kvm_arch_prepare_memory_region().
> > But as long as with current code sequence, we can't relying on that
> > kvm_prepare_memory_region() will never fail for DELETE.
> > Or, we need to call kvm_prepare_memory_region() only for !DELETE to avoid future
> > possible broken.
> 
> Agreed, I just don't want to touch the common memslots code unless it's necessary.
Ok. Let me prepare a patch for it.

> 
> > > And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the
> > > max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but
> > > the old memslot was not, and so KVM needs to allocate new metadata due to the new
> > > memslot needed larger arrays.
> > kvm_prepare_memory_region() can also fail for MOVE during live migration when
> > memslot->dirty_bitmap allocation is failed in kvm_alloc_dirty_bitmap().
> > and in x86, kvm_arch_prepare_memory_region() can also fail for MOVE if
> > kvm_alloc_memslot_metadata() fails due to -ENOMEM. 
> > BTW, I don't find the "(2) the new memslot is unaligned but the old memslot was not,
> > and so KVM needs to allocate new metadata due to the new memslot needed
> > larger arrays". 
> > > 
> > > As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM
> > > that supports KVMGT and moves relevant memory regions, let alove does something
> > > that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing.
> > > 
> > > Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses
> > > the existing dirty bitmap but doesn't shift the dirty bits.  This is likely
> > Do you mean, for the new slot in MOVE, the new dirty bitmap should be
> > cleared? Otherwise, why shift is required, given mem->userspace_addr and npages
> > are not changed and dirty_bitmap is indexed using rel_gfn 
> > (rel_gfn = gfn - memslot->base_gfn) and both QEMU/KVM aligns the bitmap
> > size to BITS_PER_LONG.
> 
> Oh, you're right.  I forgot that userspace would also be doing a gfn-relative
> calculation in the ridiculously theoretically scenario that a memslot is moved
> while it's being dirty-logged.
> 
> > > another combination that KVM can simply reject.
> > > 
> > > Assuming this is indeed purely theoretical, that should be called out in the
> > > changelog.  Or as above, simply disallow MOVE in this case, which probably has
> > > my vote.
> > >
> > Yes, currently it's a purely theoretical bug, as I'm not seeing MOVE is triggered
> > in upstream QEMU.
> > 
> > <...>
> > 
> > > I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally
> > > correct.  It might be, but there are so many subtleties in this code that I don't
> > > want to change this ordering unless absolutely necessary, or at least not without
> > > an in-depth analysis and a pile of tests.  E.g. it's possible one or more
> > > architectures relies on unmapping, flushing, and invalidating the old region
> > > prior to preparing the new region, e.g. to reuse arch memslot data.
> > yes. what about just moving kvm_arch_flush_shadow_memslot() and
> > kvm_arch_guest_memory_reclaimed() to later than kvm_arch_prepare_memory_region()?
> 
> I'm not dead set against tweaking the memslots flow, but I don't want to do so to
> fix this extremely theoretical bug.  In other words, I want to fix this by giving
> KVM-GT a more appropriate hook, not by shuffling common KVM code to fudge around
> a poor KVM x86 API.
> 
> And if KVM-GT moves away from track_flush_slot(), we can delete the hook entirely
> after cleaning up clean up another pile of ugly: KVM always registers a page-track
> notifier because it relies on the track_flush_slot() call to invoke
> kvm_mmu_invalidate_zap_pages_in_memslot(), even when KVM isn't tracking anything.
> I'll send patches for this; if/when both land, track_flush_slot() can be deleted
> on top.
Ok. thanks. I'll watch and decide what to do based on your changes.

Thanks
Yan