On 9/28/22 17:58, Sean Christopherson wrote:
On Wed, Sep 28, 2022, Paolo Bonzini wrote:
On 9/27/22 17:58, Sean Christopherson wrote:
I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's
ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
otherwise exit with mmio.len==0.
I think this patch is not a good idea for two reasons:
1) we don't know how userspace behaves if mmio.len is zero. It is of course
reasonable to do nothing, but an assertion failure is also a valid behavior
Except that KVM currently does neither. If the fetch happens at CPL>0 and/or in
L2, KVM injects #UD. That's flat out architecturally invalid. If it's a sticking
point, the mmio.len==0 hack can be avoided by defining a new exit reason.
I agree that doing this at CPL>0 or in L2 is invalid and makes little
sense (because either way the invalid address cannot be reached without
help from the supervisor or L1's page tables).
2) more important, there is no way to distinguish a failure due to the guest
going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due
to the KVM_SET_USER_MEMORY_REGION race condition. So this will cause a
guest that correctly caused an internal error to loop forever.
Userspace has the GPA and absolutely should be able to detect if the MMIO may have
been due to its memslot manipulation versus the guest jumping into the weeds.
While the former could be handled in a "wait and see" manner, the latter in
particular is part of the KVM_RUN contract. Of course it is possible for a
guest to just loop forever, but in general all of KVM, QEMU and upper
userspace layers want a crashed guest to be detected and stopped forever.
Yes, QEMU could loop only if memslot updates are in progress, but honestly
all the alternatives I have seen to atomic memslot updates are really
*awful*. David's patches even invent a new kind of mutex for which I have
absolutely no idea what kind of deadlocks one should worry about and why
they should not exist; QEMU's locking is already pretty crappy, it's
certainly not on my wishlist to make it worse!
This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the
kernel is the only place where you can have a *good* fix. It should have
been fixed years ago.
I don't disagree that the memslots API is lacking, but IMO that is somewhat
orthogonal to fixing KVM x86's "code fetch to MMIO" mess. Such a massive new API
should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
like to have the luxury of being able to explore ideas beyond "let userspace
batch memslot updates", and I really don't want to feel pressured to get this
code reviewed and merge.
I absolutely agree that this is not a bugfix. Most new features for KVM
can be seen as bug fixes if you squint hard enough, but they're still
features.
E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
do wholesale replacement? That seems like it would be vastly simpler to handle
on KVM's end. Or maybe there's a solution in the opposite direction, e.g. an
API that allows 1->N or N->1 conversions but not arbitrary batching.
Wholesale replacement was my first idea when I looked at the issue, I
think at the end of 2020. I never got to a full implementation, but my
impression was that allocating/deallocating dirty bitmaps, rmaps etc.
would make it any easier than arbitrary batch updates.
And just because QEMU's locking is "already pretty crappy", that's not a good
reason to drag KVM down into the mud. E.g. taking a lock and conditionally
releasing it... I get that this is an RFC, but IMO anything that requires such
shenanigans simply isn't acceptable.
/*
* Takes kvm->slots_arch_lock, and releases it only if
* invalid_slot allocation, kvm_prepare_memory_region failed
* or batch->is_move_delete is true.
*/
static int kvm_prepare_memslot(struct kvm *kvm,
struct kvm_internal_memory_region_list *batch)
No objection about that. :)
Paolo