On Tue, Nov 08, 2022 at 05:32:50PM +0000, Sean Christopherson wrote: > On Tue, Nov 08, 2022, Yan Zhao wrote: > > For memslot delete and move, kvm_invalidate_memslot() is required before > > the real changes committed. > > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call > > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in > > arch x86. > > And according to the definition in kvm_page_track_notifier_node, users can > > drop write-protection for the pages in the memory slot on receiving > > .track_flush_slot. > > Ugh, that's a terrible API. The entire page track API is a mess, e.g. KVMGT is > forced to grab its own references to KVM and also needs to manually acquire/release > mmu_lock in some flows but not others. > > Anyways, this is a flaw in the page track API that should be fixed. Flushing a > slot should not be overloaded to imply "this slot is gone", it should be a flush > command, no more, no less. hmm, but KVM also registers to the page track "node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;" And in kvm_mmu_invalidate_zap_pages_in_memslot, memslot (actaully all the shadow page tables) is zapped. And during the zap, unaccount_shadowed() will drop the write protection. But KVM is ok because the dropped write-protection can be rebuilt during rebuilding the shadow page table. So, for .track_flush_slot, it's expected that "users can drop write-protection for the pages in the memory slot", right? > AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter > of adding another hook that's invoked when the memory region change is committed. > Do you mean adding a new hook in page track, e.g. .track_slot_change? Then right before committing slot changes, call this interface to notify slot DELETE/MOVE? Not only KVMGT, KVM can also be affected by this failure to MOVE if it wants to support below usecase: 1. KVM pre-populated a memslot 2. memslot MOVE happened 3. KVM pre-populates the new memslot (MOVE succeeds) or the old one (MOVE fails). So also add a new interface for the MOVE failure? > That would allow KVMGT to fix another bug. If a memory region is moved and the > new region partially overlaps the old region, KVMGT technically probably wants to > retain its write-protection scheme. Though that's probably not worth supporting, > might be better to say "don't move memory regions if KVMGT is enabled", because > AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was > broken for years and no one noticed). > So, could we disable MOVE in KVM at all? > Actually, given that MOVE + KVMGT probably doesn't work regardless of where the > region is moved to, maybe we can get away with rejecting MOVE if an external > page tracker cares about the slot in question. > > > However, if kvm_prepare_memory_region() fails, the later > > kvm_activate_memslot() will only swap back the original slot, leaving > > previous write protection not recovered. > > > > This may not be a problem for kvm itself as a page tracker user, but may > > cause problem to other page tracker users, e.g. kvmgt, whose > > write-protected pages are removed from the write-protected list and not > > added back. > > > > So call kvm_prepare_memory_region first for meta data preparation before > > the slot invalidation so as to avoid failure and recovery. > > IIUC, this is purely a theoretical bug and practically speaking can't be a problem > in practice, at least not without completely breaking KVMGT. > > On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor > protected VM case is the only scenario where DELETE can fail on any architecture). > The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for > kvm_arch_prepare_memory_region(). But as long as with current code sequence, we can't relying on that kvm_prepare_memory_region() will never fail for DELETE. Or, we need to call kvm_prepare_memory_region() only for !DELETE to avoid future possible broken. > > And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the > max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but > the old memslot was not, and so KVM needs to allocate new metadata due to the new > memslot needed larger arrays. kvm_prepare_memory_region() can also fail for MOVE during live migration when memslot->dirty_bitmap allocation is failed in kvm_alloc_dirty_bitmap(). and in x86, kvm_arch_prepare_memory_region() can also fail for MOVE if kvm_alloc_memslot_metadata() fails due to -ENOMEM. BTW, I don't find the "(2) the new memslot is unaligned but the old memslot was not, and so KVM needs to allocate new metadata due to the new memslot needed larger arrays". > > As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM > that supports KVMGT and moves relevant memory regions, let alove does something > that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing. > > Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses > the existing dirty bitmap but doesn't shift the dirty bits. This is likely Do you mean, for the new slot in MOVE, the new dirty bitmap should be cleared? Otherwise, why shift is required, given mem->userspace_addr and npages are not changed and dirty_bitmap is indexed using rel_gfn (rel_gfn = gfn - memslot->base_gfn) and both QEMU/KVM aligns the bitmap size to BITS_PER_LONG. > another combination that KVM can simply reject. > > Assuming this is indeed purely theoretical, that should be called out in the > changelog. Or as above, simply disallow MOVE in this case, which probably has > my vote. > Yes, currently it's a purely theoretical bug, as I'm not seeing MOVE is triggered in upstream QEMU. <...> > I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally > correct. It might be, but there are so many subtleties in this code that I don't > want to change this ordering unless absolutely necessary, or at least not without > an in-depth analysis and a pile of tests. E.g. it's possible one or more > architectures relies on unmapping, flushing, and invalidating the old region > prior to preparing the new region, e.g. to reuse arch memslot data. yes. what about just moving kvm_arch_flush_shadow_memslot() and kvm_arch_guest_memory_reclaimed() to later than kvm_arch_prepare_memory_region()? diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 25d7872b29c1..f9d93be2ead2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1716,6 +1716,19 @@ static void kvm_copy_memslot(struct kvm_memory_slot *dest, } static void kvm_invalidate_memslot(struct kvm *kvm, + struct kvm_memory_slot *old) +{ + /* + * From this point no new shadow pages pointing to a deleted, or moved, + * memslot will be created. Validation of sp->gfn happens in: + * - gfn_to_hva (kvm_read_guest, gfn_to_pfn) + * - kvm_is_visible_gfn (mmu_check_root) + */ + kvm_arch_flush_shadow_memslot(kvm, old); + kvm_arch_guest_memory_reclaimed(kvm); +} + +static void kvm_deactive_memslot(struct kvm *kvm, struct kvm_memory_slot *old, struct kvm_memory_slot *invalid_slot) { @@ -1735,15 +1748,6 @@ static void kvm_invalidate_memslot(struct kvm *kvm, */ kvm_swap_active_memslots(kvm, old->as_id); - /* - * From this point no new shadow pages pointing to a deleted, or moved, - * memslot will be created. Validation of sp->gfn happens in: - * - gfn_to_hva (kvm_read_guest, gfn_to_pfn) - * - kvm_is_visible_gfn (mmu_check_root) - */ - kvm_arch_flush_shadow_memslot(kvm, old); - kvm_arch_guest_memory_reclaimed(kvm); - /* Was released by kvm_swap_active_memslots, reacquire. */ mutex_lock(&kvm->slots_arch_lock); @@ -1846,7 +1850,7 @@ static int kvm_set_memslot(struct kvm *kvm, mutex_unlock(&kvm->slots_arch_lock); return -ENOMEM; } - kvm_invalidate_memslot(kvm, old, invalid_slot); + kvm_deactive_memslot(kvm, old, invalid_slot); } r = kvm_prepare_memory_region(kvm, old, new, change); @@ -1866,6 +1870,9 @@ static int kvm_set_memslot(struct kvm *kvm, return r; } + if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) + kvm_invalidate_memslot(kvm, old); + /* * For DELETE and MOVE, the working slot is now active as the INVALID * version of the old slot. MOVE is particularly special as it reuses Thanks Yan