Am 13/10/2022 um 10:44 schrieb David Hildenbrand: > On 13.10.22 09:43, Emanuele Giuseppe Esposito wrote: >> >> >> Am 29/09/2022 um 23:39 schrieb Sean Christopherson: >>> If we really want to provide a better experience for userspace, why >>> not provide >>> more primitives to address those specific use cases? E.g. fix KVM's >>> historic wart >>> of disallowing toggling of KVM_MEM_READONLY, and then add one or more >>> ioctls to: >>> >>> - Merge 2+ memory regions that are contiguous in GPA and HVA >>> - Split a memory region into 2+ memory regions >>> - Truncate a memory region >>> - Grow a memory region >> >> I looked again at the code and specifically the use case that triggers >> the crash in bugzilla. I *think* (correct me if I am wrong), that the >> only operation that QEMU performs right now is "grow/shrink". > > I remember that there were BUG reports where we'd actually split and run > into that problem. Just don't have them at hand. I think they happened > during early boot when the OS re-configured some PCI thingies. If you could point me where this is happening, it would be nice. So far I could not find or see any split/merge operation. > >> >> So *if* we want to go this way, we could start with a simple grow/shrink >> API. >> >> Even though we need to consider that this could bring additional >> complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not >> performed one after the other (in my innocent fantasy I was expecting to >> find 2 subsequent ioctls in the code), but in QEMU's >> address_space_set_flatview(), which seems to first remove all regions >> and then add them when changing flatviews. >> >> address_space_update_topology_pass(as, old_view2, new_view, >> adding=false); >> address_space_update_topology_pass(as, old_view2, new_view, adding=true); >> >> I don't think we can change this, as other listeners also rely on such >> ordering, but we can still batch all callback requests like I currently >> do and process them in kvm_commit(), figuring there which operation is >> which. >> >> In other words, we would have something like: >> >> region_del() --> DELETE memslot X -> add it to the queue of operations >> region_del() --> DELETE memslot Y -> add it to the queue of operations >> region_add() --> CREATE memslot X (size doubled) -> add it to the queue >> of operations >> region_add() --> CREATE memslot Y (size halved) -> add it to the queue >> of operations >> ... >> commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK >> Y (-size) -> issue 2 ioctls, GROW and SHRINK. > > I communicated resizes (region_resize()) to the notifier in patch #3 of > https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@xxxxxxxxxx/ > > You could use that independently of the remainder. It should be less > controversial ;) > > But I think it only tackles part of the more generic problem (split/merge). Yes, very useful! Using that patch we would have a single place where to issue grow/shrink ioctls. I don't think we need to inhibit ioctls, since the operation will be atomic (this time in the true meaning, since we don't need INVALIDATE. > >> >>> That would probably require more KVM code overall, but each operation >>> would be more >>> tightly bounded and thus simpler to define. And I think more precise >>> APIs would >>> provide other benefits, e.g. growing a region wouldn't require first >>> deleting the >>> current region, and so could avoid zapping SPTEs and destroying >>> metadata. Merge, >>> split, and truncate would have similar benefits, though they might be >>> more >>> difficult to realize in practice. >> >> So essentially grow would not require INVALIDATE. Makes sense, but would >> it work also with shrink? I guess so, as the memslot is still present >> (but shrinked) right? >> >> Paolo, would you be ok with this smaller API? Probably just starting >> with grow and shrink first. >> >> I am not against any of the two approaches: >> - my approach has the disadvantage that the list could be arbitrarily >> long, and it is difficult to rollback the intermediate changes if >> something breaks during the request processing (but could be simplified >> by making kvm exit or crash). >> >> - Sean approach could potentially provide more burden to the userspace, >> as we need to figure which operation is which. Also from my >> understanding split and merge won't be really straightforward to >> implement, especially in userspace. >> >> David, any concern from userspace prospective on this "CISC" approach? > > In contrast to resizes in QEMU that only affect a single memory > region/slot, splitting/merging is harder to factor out and communicate > to a notifier. As an alternative, we could handle it in the commit stage > in the notifier itself, similar to what my prototype does, and figure > out what needs to be done in there and how to call the proper KVM > interface (and which interface to call). > > With virtio-mem (in the future) we might see merges of 2 slots into a > single one, by closing a gap in-between them. In "theory" we could > combine some updates into a single transaction. But it won't be 100s ... > > I think I'd prefer an API that doesn't require excessive ad-hoc > extensions later once something in QEMU changes. > > > I think in essence, all we care about is performing atomic changes that > *have to be atomic*, because something we add during a transaction > overlaps with something we remove during a transaction. Not necessarily > all updates in a transaction! > > My prototype essentially detects that scenario, but doesn't call new KVM > interfaces to deal with these. With "prototype" I assume you mean the patch linked above (region_resize), not the userspace-only proposal you sent initially right? > > I assume once we take that into consideration, we can mostly assume that > any such atomic updates (only of the regions that really have to be part > of an atomic update) won't involve more than a handful of memory > regions. We could add a sane KVM API limit. > > And if we run into that API limit in QEMU, we can print a warning and do > it non-atomically. > If we implement single operations (split/merge/grow/shrink), we don't even need that limit. Except for merge, maybe. Ok, if it'ok for you all I can try to use David patch and implement some simple grow/shrink. Then we need to figure where and when exactly QEMU performs split and merge operations, and maybe implement something similar to what you did in your proposal? Thank you, Emanuele