On Thu, Nov 10, 2022 at 12:33:25AM +0000, Sean Christopherson wrote: > On Wed, Nov 09, 2022, Yan Zhao wrote: > > On Tue, Nov 08, 2022 at 05:32:50PM +0000, Sean Christopherson wrote: > > > On Tue, Nov 08, 2022, Yan Zhao wrote: > > > > For memslot delete and move, kvm_invalidate_memslot() is required before > > > > the real changes committed. > > > > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call > > > > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in > > > > arch x86. > > > > And according to the definition in kvm_page_track_notifier_node, users can > > > > drop write-protection for the pages in the memory slot on receiving > > > > .track_flush_slot. > > > > > > Ugh, that's a terrible API. The entire page track API is a mess, e.g. KVMGT is > > > forced to grab its own references to KVM and also needs to manually acquire/release > > > mmu_lock in some flows but not others. > > > > > > Anyways, this is a flaw in the page track API that should be fixed. Flushing a > > > slot should not be overloaded to imply "this slot is gone", it should be a flush > > > command, no more, no less. > > hmm, but KVM also registers to the page track > > "node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;" > > And in kvm_mmu_invalidate_zap_pages_in_memslot, memslot (actaully all the shadow > > page tables) is zapped. And during the zap, unaccount_shadowed() will drop the > > write protection. But KVM is ok because the dropped write-protection can be > > rebuilt during rebuilding the shadow page table. > > So, for .track_flush_slot, it's expected that "users can drop write-protection > > for the pages in the memory slot", right? > > No. KVM isn't actually dropping write-protection, because for the internal KVM > case, KVM obliterates all of its page tables. > > > > AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter > > > of adding another hook that's invoked when the memory region change is committed. > > > > > Do you mean adding a new hook in page track, e.g. .track_slot_change? > > Then right before committing slot changes, call this interface to notify slot > > DELETE/MOVE? > > > > Not only KVMGT, KVM can also be affected by this failure to MOVE if it wants to > > support below usecase: > > 1. KVM pre-populated a memslot > > 2. memslot MOVE happened > > 3. KVM pre-populates the new memslot (MOVE succeeds) or the old one (MOVE fails). > > So also add a new interface for the MOVE failure? > > > > > That would allow KVMGT to fix another bug. If a memory region is moved and the > > > new region partially overlaps the old region, KVMGT technically probably wants to > > > retain its write-protection scheme. Though that's probably not worth supporting, > > > might be better to say "don't move memory regions if KVMGT is enabled", because > > > AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was > > > broken for years and no one noticed). > > > > > So, could we disable MOVE in KVM at all? > > Ideally, yes. Practically? Dunno. It's difficult to know whether or not there > are users out there. > > > > Actually, given that MOVE + KVMGT probably doesn't work regardless of where the > > > region is moved to, maybe we can get away with rejecting MOVE if an external > > > page tracker cares about the slot in question. > > > > > > > However, if kvm_prepare_memory_region() fails, the later > > > > kvm_activate_memslot() will only swap back the original slot, leaving > > > > previous write protection not recovered. > > > > > > > > This may not be a problem for kvm itself as a page tracker user, but may > > > > cause problem to other page tracker users, e.g. kvmgt, whose > > > > write-protected pages are removed from the write-protected list and not > > > > added back. > > > > > > > > So call kvm_prepare_memory_region first for meta data preparation before > > > > the slot invalidation so as to avoid failure and recovery. > > > > > > IIUC, this is purely a theoretical bug and practically speaking can't be a problem > > > in practice, at least not without completely breaking KVMGT. > > > > > > On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor > > > protected VM case is the only scenario where DELETE can fail on any architecture). > > > The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for > > > kvm_arch_prepare_memory_region(). > > But as long as with current code sequence, we can't relying on that > > kvm_prepare_memory_region() will never fail for DELETE. > > Or, we need to call kvm_prepare_memory_region() only for !DELETE to avoid future > > possible broken. > > Agreed, I just don't want to touch the common memslots code unless it's necessary. Ok. Let me prepare a patch for it. > > > > And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the > > > max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but > > > the old memslot was not, and so KVM needs to allocate new metadata due to the new > > > memslot needed larger arrays. > > kvm_prepare_memory_region() can also fail for MOVE during live migration when > > memslot->dirty_bitmap allocation is failed in kvm_alloc_dirty_bitmap(). > > and in x86, kvm_arch_prepare_memory_region() can also fail for MOVE if > > kvm_alloc_memslot_metadata() fails due to -ENOMEM. > > BTW, I don't find the "(2) the new memslot is unaligned but the old memslot was not, > > and so KVM needs to allocate new metadata due to the new memslot needed > > larger arrays". > > > > > > As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM > > > that supports KVMGT and moves relevant memory regions, let alove does something > > > that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing. > > > > > > Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses > > > the existing dirty bitmap but doesn't shift the dirty bits. This is likely > > Do you mean, for the new slot in MOVE, the new dirty bitmap should be > > cleared? Otherwise, why shift is required, given mem->userspace_addr and npages > > are not changed and dirty_bitmap is indexed using rel_gfn > > (rel_gfn = gfn - memslot->base_gfn) and both QEMU/KVM aligns the bitmap > > size to BITS_PER_LONG. > > Oh, you're right. I forgot that userspace would also be doing a gfn-relative > calculation in the ridiculously theoretically scenario that a memslot is moved > while it's being dirty-logged. > > > > another combination that KVM can simply reject. > > > > > > Assuming this is indeed purely theoretical, that should be called out in the > > > changelog. Or as above, simply disallow MOVE in this case, which probably has > > > my vote. > > > > > Yes, currently it's a purely theoretical bug, as I'm not seeing MOVE is triggered > > in upstream QEMU. > > > > <...> > > > > > I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally > > > correct. It might be, but there are so many subtleties in this code that I don't > > > want to change this ordering unless absolutely necessary, or at least not without > > > an in-depth analysis and a pile of tests. E.g. it's possible one or more > > > architectures relies on unmapping, flushing, and invalidating the old region > > > prior to preparing the new region, e.g. to reuse arch memslot data. > > yes. what about just moving kvm_arch_flush_shadow_memslot() and > > kvm_arch_guest_memory_reclaimed() to later than kvm_arch_prepare_memory_region()? > > I'm not dead set against tweaking the memslots flow, but I don't want to do so to > fix this extremely theoretical bug. In other words, I want to fix this by giving > KVM-GT a more appropriate hook, not by shuffling common KVM code to fudge around > a poor KVM x86 API. > > And if KVM-GT moves away from track_flush_slot(), we can delete the hook entirely > after cleaning up clean up another pile of ugly: KVM always registers a page-track > notifier because it relies on the track_flush_slot() call to invoke > kvm_mmu_invalidate_zap_pages_in_memslot(), even when KVM isn't tracking anything. > I'll send patches for this; if/when both land, track_flush_slot() can be deleted > on top. Ok. thanks. I'll watch and decide what to do based on your changes. Thanks Yan