On Tue, Nov 08, 2022, Yan Zhao wrote: > For memslot delete and move, kvm_invalidate_memslot() is required before > the real changes committed. > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in > arch x86. > And according to the definition in kvm_page_track_notifier_node, users can > drop write-protection for the pages in the memory slot on receiving > .track_flush_slot. Ugh, that's a terrible API. The entire page track API is a mess, e.g. KVMGT is forced to grab its own references to KVM and also needs to manually acquire/release mmu_lock in some flows but not others. Anyways, this is a flaw in the page track API that should be fixed. Flushing a slot should not be overloaded to imply "this slot is gone", it should be a flush command, no more, no less. AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter of adding another hook that's invoked when the memory region change is committed. That would allow KVMGT to fix another bug. If a memory region is moved and the new region partially overlaps the old region, KVMGT technically probably wants to retain its write-protection scheme. Though that's probably not worth supporting, might be better to say "don't move memory regions if KVMGT is enabled", because AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was broken for years and no one noticed). Actually, given that MOVE + KVMGT probably doesn't work regardless of where the region is moved to, maybe we can get away with rejecting MOVE if an external page tracker cares about the slot in question. > However, if kvm_prepare_memory_region() fails, the later > kvm_activate_memslot() will only swap back the original slot, leaving > previous write protection not recovered. > > This may not be a problem for kvm itself as a page tracker user, but may > cause problem to other page tracker users, e.g. kvmgt, whose > write-protected pages are removed from the write-protected list and not > added back. > > So call kvm_prepare_memory_region first for meta data preparation before > the slot invalidation so as to avoid failure and recovery. IIUC, this is purely a theoretical bug and practically speaking can't be a problem in practice, at least not without completely breaking KVMGT. On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor protected VM case is the only scenario where DELETE can fail on any architecture). The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for kvm_arch_prepare_memory_region(). And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but the old memslot was not, and so KVM needs to allocate new metadata due to the new memslot needed larger arrays. As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM that supports KVMGT and moves relevant memory regions, let alove does something that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing. Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses the existing dirty bitmap but doesn't shift the dirty bits. This is likely another combination that KVM can simply reject. Assuming this is indeed purely theoretical, that should be called out in the changelog. Or as above, simply disallow MOVE in this case, which probably has my vote. > Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > --- > virt/kvm/kvm_main.c | 40 +++++++++++++++------------------------- > 1 file changed, 15 insertions(+), 25 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 25d7872b29c1..5f29011f432d 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1827,45 +1827,35 @@ static int kvm_set_memslot(struct kvm *kvm, > */ > mutex_lock(&kvm->slots_arch_lock); > > - /* > - * Invalidate the old slot if it's being deleted or moved. This is > - * done prior to actually deleting/moving the memslot to allow vCPUs to > - * continue running by ensuring there are no mappings or shadow pages > - * for the memslot when it is deleted/moved. Without pre-invalidation > - * (and without a lock), a window would exist between effecting the > - * delete/move and committing the changes in arch code where KVM or a > - * guest could access a non-existent memslot. > - * > - * Modifications are done on a temporary, unreachable slot. The old > - * slot needs to be preserved in case a later step fails and the > - * invalidation needs to be reverted. > - */ > if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) { > invalid_slot = kzalloc(sizeof(*invalid_slot), GFP_KERNEL_ACCOUNT); > if (!invalid_slot) { > mutex_unlock(&kvm->slots_arch_lock); > return -ENOMEM; > } > - kvm_invalidate_memslot(kvm, old, invalid_slot); > } > > r = kvm_prepare_memory_region(kvm, old, new, change); > if (r) { > - /* > - * For DELETE/MOVE, revert the above INVALID change. No > - * modifications required since the original slot was preserved > - * in the inactive slots. Changing the active memslots also > - * release slots_arch_lock. > - */ > - if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) { > - kvm_activate_memslot(kvm, invalid_slot, old); > + if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) > kfree(invalid_slot); > - } else { > - mutex_unlock(&kvm->slots_arch_lock); > - } > + > + mutex_unlock(&kvm->slots_arch_lock); > return r; > } > > + /* > + * Invalidate the old slot if it's being deleted or moved. This is > + * done prior to actually deleting/moving the memslot to allow vCPUs to > + * continue running by ensuring there are no mappings or shadow pages > + * for the memslot when it is deleted/moved. Without pre-invalidation > + * (and without a lock), a window would exist between effecting the > + * delete/move and committing the changes in arch code where KVM or a > + * guest could access a non-existent memslot. > + */ > + if (change == KVM_MR_DELETE || change == KVM_MR_MOVE) > + kvm_invalidate_memslot(kvm, old, invalid_slot); I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally correct. It might be, but there are so many subtleties in this code that I don't want to change this ordering unless absolutely necessary, or at least not without an in-depth analysis and a pile of tests. E.g. it's possible one or more architectures relies on unmapping, flushing, and invalidating the old region prior to preparing the new region, e.g. to reuse arch memslot data. > + > /* > * For DELETE and MOVE, the working slot is now active as the INVALID > * version of the old slot. MOVE is particularly special as it reuses > -- > 2.17.1 >