On Wed, Sep 28, 2022, Paolo Bonzini wrote: > On 9/27/22 17:58, Sean Christopherson wrote: > > I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes > > the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's > > ABI in any way, and it can even become "official" behavior since KVM x86 doesn't > > otherwise exit with mmio.len==0. > > I think this patch is not a good idea for two reasons: > > 1) we don't know how userspace behaves if mmio.len is zero. It is of course > reasonable to do nothing, but an assertion failure is also a valid behavior Except that KVM currently does neither. If the fetch happens at CPL>0 and/or in L2, KVM injects #UD. That's flat out architecturally invalid. If it's a sticking point, the mmio.len==0 hack can be avoided by defining a new exit reason. > 2) more important, there is no way to distinguish a failure due to the guest > going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due > to the KVM_SET_USER_MEMORY_REGION race condition. So this will cause a > guest that correctly caused an internal error to loop forever. Userspace has the GPA and absolutely should be able to detect if the MMIO may have been due to its memslot manipulation versus the guest jumping into the weeds. > While the former could be handled in a "wait and see" manner, the latter in > particular is part of the KVM_RUN contract. Of course it is possible for a > guest to just loop forever, but in general all of KVM, QEMU and upper > userspace layers want a crashed guest to be detected and stopped forever. > > Yes, QEMU could loop only if memslot updates are in progress, but honestly > all the alternatives I have seen to atomic memslot updates are really > *awful*. David's patches even invent a new kind of mutex for which I have > absolutely no idea what kind of deadlocks one should worry about and why > they should not exist; QEMU's locking is already pretty crappy, it's > certainly not on my wishlist to make it worse! > > This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the > kernel is the only place where you can have a *good* fix. It should have > been fixed years ago. I don't disagree that the memslots API is lacking, but IMO that is somewhat orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd like to have the luxury of being able to explore ideas beyond "let userspace batch memslot updates", and I really don't want to feel pressured to get this code reviewed and merge. E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to do wholesale replacement? That seems like it would be vastly simpler to handle on KVM's end. Or maybe there's a solution in the opposite direction, e.g. an API that allows 1->N or N->1 conversions but not arbitrary batching. And just because QEMU's locking is "already pretty crappy", that's not a good reason to drag KVM down into the mud. E.g. taking a lock and conditionally releasing it... I get that this is an RFC, but IMO anything that requires such shenanigans simply isn't acceptable. /* * Takes kvm->slots_arch_lock, and releases it only if * invalid_slot allocation, kvm_prepare_memory_region failed * or batch->is_move_delete is true. */ static int kvm_prepare_memslot(struct kvm *kvm, struct kvm_internal_memory_region_list *batch)