On Wed, Oct 26, 2022, Paolo Bonzini wrote: > On 10/26/22 01:07, Sean Christopherson wrote: > > > > - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT > > > > > > Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it > > > a guest bug to access the memory (i.e. ignore the strange read-only changes > > > which only happen at boot, and which I agree are QEMU-specific)? > > > > Yes? I don't know exactly what "the KVM_GET_DIRTY_LOG case" is. > > It is not possible to atomically read the dirty bitmap and delete a memslot. > When you delete a memslot, the bitmap is gone. In this case however memory > accesses to the deleted memslot are a guest bug, so stopping KVM-GT would > not be necessary. If accesses to the deleted memslot are a guest bug, why do you care about pausing vCPUs? I don't mean to be beligerent, I'm genuinely confused. > So while I'm being slowly convinced that QEMU should find a way to pause its > vCPUs around memslot changes, I'm not sure that pausing everything is needed > in general. > > > > > And because of the nature of KVM, to support this API on all architectures, KVM > > > > needs to make change on all architectures, whereas userspace should be able to > > > > implement a generic solution. > > > > > > Yes, I agree that this is essentially just a more efficient kill(). > > > Emanuele, perhaps you can put together a patch to x86/vmexit.c in > > > kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in > > > a for(;;) busy wait, to measure the various ways to do it? > > > > I'm a bit confused. Is the goal of this to simplify QEMU, dedup VMM code, provide > > a more performant solution, something else entirely? > > Well, a bit of all of them and perhaps that's the problem. And while the > issues at hand *are* self-inflicted wounds on part of QEMU, it seems to me > that the underlying issues are general. > > For example, Alex Graf and I looked back at your proposal of a userspace > exit for "bad" accesses to memory, wondering if it could help with Hyper-V > VTLs too. To recap, the "higher privileged" code at VTL1 can set up VM-wide > restrictions on access to some pages through a hypercall > (HvModifyVtlProtectionMask). After the hypercall, VTL0 would not be able to > access those pages. The hypercall would be handled in userspace and would > invoke a KVM_SET_MEMORY_REGION_PERM ioctl to restrict the RWX permissions, > and this ioctl would set up a VM-wide permission bitmap that would be used > when building page tables. > > Using such a bitmap instead of memslots makes it possible to cause userspace > vmexits on VTL mapping violations with efficient data structures. And it > would also be possible to use this mechanism around KVM_GET_DIRTY_LOG, to > read the KVM dirty bitmap just before removing a memslot. What exactly is the behavior you're trying to achieve for KVM_GET_DIRTY_LOG => delete? If KVM provides KVM_EXIT_MEMORY_FAULT, can you not achieve the desired behavior by doing mprotect(PROT_NONE) => KVM_GET_DIRTY_LOG => delete? If PROT_NONE causes the memory to be freed, won't mprotect(PROT_READ) do what you want even without KVM_EXIT_MEMORY_FAULT? > However, external accesses to the regions (ITS, Xen, KVM-GT, non KVM_RUN > ioctls) would not be blocked, due to the lack of a way to report the exit. Aren't all of those out of scope? E.g. in a very hypothetical world where XEN's event channel is being used with VTLs, if VTL1 makes the event channel inaccessible, that's a guest and/or userspace configuration issue and the guest is hosed no matter what KVM does. Ditto for these case where KVM-GT's buffer is blocked. I'm guessing the ITS is similar? > The intersection of these features with VTLs should be very small (sometimes > zero since VTLs are x86 only), but the ioctls would be a problem so I'm > wondering what your thoughts are on this. How do the ioctls() map to VTLs? I.e. are they considered VTL0, VTL1, out-of-band? > Also, while the exit API could be the same, it is not clear to me that the > permission bitmap would be a good match for entirely "void" memslots used to > work around non-atomic memslot changes. So for now let's leave this aside > and only consider the KVM_GET_DIRTY_LOG case. As above, can't userspace just mprotect() the entire memslot to prevent writes between getting the dirty log and deleting the memslot?