On Thu, Sep 29, 2022, Paolo Bonzini wrote: > > On 9/28/22 22:41, Sean Christopherson wrote: > If you think we should overhaul it even more than just providing atomic > batched updates, that's fine. Or maybe solve the problem(s) with more precision? What I don't like about the batching is that it's not "atomic", and AFAICT doesn't have line of sight to truly becoming atomic. When I hear "atomic" updates, I think "all transactions are commited at once". That is not what this proposed API does, and due to the TLB flushing requirements, I don't see how it can be implemented. > But still, the impossibility to perform > atomic updates in batches *is* a suboptimal part of the KVM API. I don't disagree, but I honestly don't see the difference between "batching" and providing a KVM_MEM_USER_EXIT flag. The batch API will provide better performance, but I would be surprised if it's significant enough to matter. I'm not saying that KVM_MEM_USER_EXIT provides novel functionality, e.g. I think we're in agreement that handling memslot metadata will suck regardless. My point is that it's simpler to implement in KVM, much easier to document, and for all intents and purposes provides the same desired functionality: vCPUs that access in-flux memslots are stalled and so userspace doesn't have to pause and rendezvous all vCPUs to provide "atomic" updates. > The main cases are: > > - for the boot case, splitting and merging existing memslots. QEMU likes to > merge adjacent memory regions into a single memslot, so if something goes > from read-write to read-only it has to be split and vice versa. I guess a > "don't merge this memory region" flag would be the less hideous way to solve > it in userspace. > > - however, there is also the case of resizing an existing memslot which is > what David would like to have for virtio-mem. This is not really fixable > because part of the appeal of virtio-mem is to have a single huge memslot > instead of many smaller ones, in order to reduce the granularity of > add/remove (David, correct me if I'm wrong). > > (The latter _would_ be needed by other VMMs). If we really want to provide a better experience for userspace, why not provide more primitives to address those specific use cases? E.g. fix KVM's historic wart of disallowing toggling of KVM_MEM_READONLY, and then add one or more ioctls to: - Merge 2+ memory regions that are contiguous in GPA and HVA - Split a memory region into 2+ memory regions - Truncate a memory region - Grow a memory region That would probably require more KVM code overall, but each operation would be more tightly bounded and thus simpler to define. And I think more precise APIs would provide other benefits, e.g. growing a region wouldn't require first deleting the current region, and so could avoid zapping SPTEs and destroying metadata. Merge, split, and truncate would have similar benefits, though they might be more difficult to realize in practice. What I really don't like about the "batch" API, aside from the "atomic" name, is that it's too open ended and creates conflicts. E.g. to allow "atomic" merging after converting from RO=>RW, KVM needs to allow DELETE=>DELETE=>CREATE, and that implies that each operation in the batch operates on the working state created by the previous operations. But if userspace provides a batch that does DELETE=>CREATE=>DELETE, naively hoisting all "invalidations" to the front half of the "atomic" operations breaks the ordering. And there's a ridiculous amount of potential weirdness with MOVE=>MOVE operations. KVM could solve those issues by disallowing MOVE and disallowing DELETE after CREATE or FLAGS_ONLY, but then userspace is effectively restricted to operating on a single contiguous chunk of memslots. At that point, it seems that providing primitives to do each macro operation is an even better experience for userspace. In other words, make memslots a bit more CISC ;-)