On 27.08.22 22:58, Peter Xu wrote: > Hi, Emanuele, > > On Fri, Aug 26, 2022 at 04:07:01PM +0200, Emanuele Giuseppe Esposito wrote: >> >> >> Am 22/08/2022 um 16:10 schrieb Peter Xu: >>> On Thu, Aug 18, 2022 at 09:55:20PM -0300, Leonardo Bras Soares Passos wrote: >>>> On Thu, Aug 18, 2022 at 5:05 PM Peter Xu <peterx@xxxxxxxxxx> wrote: >>>>> >>>>> On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote: >>>>>> +static void kvm_memory_region_node_add(KVMMemoryListener *kml, >>>>>> + struct kvm_userspace_memory_region *mem) >>>>>> +{ >>>>>> + MemoryRegionNode *node; >>>>>> + >>>>>> + node = g_malloc(sizeof(MemoryRegionNode)); >>>>>> + *node = (MemoryRegionNode) { >>>>>> + .mem = mem, >>>>>> + }; >>>>> >>>>> Nit: direct assignment of struct looks okay, but maybe pointer assignment >>>>> is clearer (with g_malloc0? Or iirc we're suggested to always use g_new0): >>>>> >>>>> node = g_new0(MemoryRegionNode, 1); >>>>> node->mem = mem; >>>>> >>>>> [...] >> >> Makes sense >> >>>>> >>>>>> +/* for KVM_SET_USER_MEMORY_REGION_LIST */ >>>>>> +struct kvm_userspace_memory_region_list { >>>>>> + __u32 nent; >>>>>> + __u32 flags; >>>>>> + struct kvm_userspace_memory_region entries[0]; >>>>>> +}; >>>>>> + >>>>>> /* >>>>>> * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, >>>>>> * other bits are reserved for kvm internal use which are defined in >>>>>> @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce { >>>>>> struct kvm_userspace_memory_region) >>>>>> #define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47) >>>>>> #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO, 0x48, __u64) >>>>>> +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \ >>>>>> + struct kvm_userspace_memory_region_list) >>>>> >>>>> I think this is probably good enough, but just to provide the other small >>>>> (but may not be important) piece of puzzle here. I wanted to think through >>>>> to understand better but I never did.. >>>>> >>>>> For a quick look, please read the comment in kvm_set_phys_mem(). >>>>> >>>>> /* >>>>> * NOTE: We should be aware of the fact that here we're only >>>>> * doing a best effort to sync dirty bits. No matter whether >>>>> * we're using dirty log or dirty ring, we ignored two facts: >>>>> * >>>>> * (1) dirty bits can reside in hardware buffers (PML) >>>>> * >>>>> * (2) after we collected dirty bits here, pages can be dirtied >>>>> * again before we do the final KVM_SET_USER_MEMORY_REGION to >>>>> * remove the slot. >>>>> * >>>>> * Not easy. Let's cross the fingers until it's fixed. >>>>> */ >>>>> >>>>> One example is if we have 16G mem, we enable dirty tracking and we punch a >>>>> hole of 1G at offset 1G, it'll change from this: >>>>> >>>>> (a) >>>>> |----------------- 16G -------------------| >>>>> >>>>> To this: >>>>> >>>>> (b) (c) (d) >>>>> |--1G--|XXXXXX|------------14G------------| >>>>> >>>>> Here (c) will be a 1G hole. >>>>> >>>>> With current code, the hole punching will del region (a) and add back >>>>> region (b) and (d). After the new _LIST ioctl it'll be atomic and nicer. >>>>> >>>>> Here the question is if we're with dirty tracking it means for each region >>>>> we have a dirty bitmap. Currently we do the best effort of doing below >>>>> sequence: >>>>> >>>>> (1) fetching dirty bmap of (a) >>>>> (2) delete region (a) >>>>> (3) add region (b) (d) >>>>> >>>>> Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose >>>>> dirty pages written between step (1) and (2) (and actually if the write >>>>> comes within (2) and (3) I think it'll crash qemu, and iiuc that's what >>>>> we're going to fix..). >>>>> >>>>> So ideally the atomic op can be: >>>>> >>>>> "atomically fetch dirty bmap for removed regions, remove regions, and add >>>>> new regions" >>>>> >>>>> Rather than only: >>>>> >>>>> "atomically remove regions, and add new regions" >>>>> >>>>> as what the new _LIST ioctl do. >>>>> >>>>> But... maybe that's not a real problem, at least I didn't know any report >>>>> showing issue with current code yet caused by losing of dirty bits during >>>>> step (1) and (2). Neither do I know how to trigger an issue with it. >>>>> >>>>> I'm just trying to still provide this information so that you should be >>>>> aware of this problem too, at the meantime when proposing the new ioctl >>>>> change for qemu we should also keep in mind that we won't easily lose the >>>>> dirty bmap of (a) here, which I think this patch does the right thing. >>>>> >>>> >>>> Thanks for bringing these details Peter! >>>> >>>> What do you think of adding? >>>> (4) Copy the corresponding part of (a)'s dirty bitmap to (b) and (d)'s >>>> dirty bitmaps. >>> >>> Sounds good to me, but may not cover dirty ring? Maybe we could move on >>> with the simple but clean scheme first and think about a comprehensive >>> option only if very necessary. The worst case is we need one more kvm cap >>> but we should still have enough. >> >> Ok then, I will not consider this for now. >> >> Might or might not be relevant, but I was also considering to >> pre-process the list of memslots passed to the ioctl and merge >> operations when necessary, to avoid unnecessary operations. >> >> For example, if we are creating an area and punching a hole (assuming >> this is a valid example), we would have >> >> transaction_begin() >> CREATE(offset=0, memory area) >> DELETE(memory area) >> CREATE(offset=0, memory area / 2 - 1) >> CREATE(offset=memory_area/2, memory area / 2) >> transaction_commmit() >> >> Instead, if we pre-process the memory regions and detect overlaps with >> an interval tree, we could simplify the above with: >> CREATE(offset=0, memory area / 2 - 1) >> CREATE(offset=memory_area/2, memory area / 2) > > As I replied in the private email, I don't think the pre-process here is > needed, because afaict flat views already handle that. > > See generate_memory_topology() and especially render_memory_region(). > > In above example, the 1st CREATE + DELETE shouldn't reach any of the memory > listners, including the KVM one, because the flatview only contains the > final layout of the address space when commit() happens. > >> >> In addition, I was thinking to also provide the operation type (called >> enum kvm_mr_change) from userspace directly, and not "figure" it >> ourselves in KVM. >> >> The reason for these two changes come from KVM implementation. I know >> this is no KVM place, but a background explanation might be necessary. >> Basically KVM 1) figures the request type by looking at the fields >> passed by userspace (for example mem_size == 0 means DELETE) and the >> current status of the memslot (id not found means CREATE, found means >> MOVE/UPDATE_FLAGS), and 2) has 2 memslot lists per address space: the >> active (visible) and inactive. Any operation is performed in the >> inactive list, then we "swap" the two so that the change is visible. >> >> The "atomic" goal of this serie just means that we want to apply >> multiple memslot operation and then perform a single "swap". >> The problem is that each DELETE and MOVE request perform 2 swaps: first >> substitute current memslot with an invalid one (so that page faults are >> retried), and then remove the invalid one (in case of MOVE, substitute >> with new memslot). >> >> Therefore: >> - KVM should ideally pre-process all DELETE/MOVE memslots and perform a >> first swap(s) to replace the current memslot with an invalid one, and >> then perform all other operations in order (deletion/move included). > > This sounds correct. > >> This is why QEMU should just send pre-merged MR, so that we don't have >> CREATE(x)- DELETE(x). Otherwise we would process a DELETE on a MR that >> doesn't exist yet. > > As discussed above, IMHO pre-merge isn't an issue here and I think the > sequence in your example won't happen. But you raised a great question on > whether we should allow the new ioctl (if there's going to be one) to do > whatever it wants by batching any sequence of memslot operations. > > More below. > >> >> - Doing the above is still not enough, as KVM figures what operation to >> do depending on the current state of the memslots. >> Assuming we already have an already existing MR y, and now we get the >> list DELETE(y) CREATE(y/2) (ie reducing y to half its size). >> In this case the interval tree can't do anything, but it's very hard to >> understand that the second request in the list is a CREATE, because when >> KVM performs the check to understand which type of operation it is >> (before doing any modification at all in the memslot list), it finds >> that y (memslot with same id) exist, but has a different size than what >> provided from userspace, therefore it could either fail, or misinterpret >> it as another operation (currently fails -EINVALID). > > Another good question.. I think that can be partly solved by not allowing > ID duplication in the same batched ioctl, or maybe we can fail it. From > qemu perspective, we may need to change the memslot id allocation to not > reuse IDs of being-deleted memslots until the batch is processed. > > Something like adding similar INVALID tags to kvm memslots in QEMU when we > are preparing the batch, then we should only reset memory_size==0 and clear > INVALID tag after the ioctl returns. Then during the batch, any new slots > to be added from kvm_get_free_slot() will not return any duplication of a > deleting one. > >> If we instead already provide the labels, then we can: >> 1. substitute the memslots pointed by DELETE/MOVE with invalid & swap >> (so it is visible, non-atomic. But we don't care, as we are not deleting >> anything) >> 2. delete the invalid memslot (in the inactive memslot list) >> 3. process the other requests (in the inactive memslot list) >> 4. single and atomic swap (step 2 and 3 are now visible). >> >> What do you think? > > Adding some limitation to the new ioctl makes sense to me, especially > moving the DELETE/MOVE to be handled earlier, rather than a complete > mixture of all of them. > > I'm wondering whether the batch ioctl can be layed out separately on the > operations, then it can be clearly documented on the sequence of handling > each op: > > struct kvm_userspace_memory_region_batch { > struct kvm_userspace_memory_region deletes[]; > struct kvm_userspace_memory_region moves[]; > struct kvm_userspace_memory_region creates[]; > struct kvm_userspace_memory_region flags_only[]; > }; > > So that the ordering can be very obvious. But I didn't really think deeper > than that. > > Side note: do we use MOVE at all in QEMU? > >> >> Bonus question: with this atomic operation, do we really need the >> invalid memslot logic for MOVE/DELETE? > > I think so. Not an expert on that, but iiuc that's to make sure we'll zap > all shadow paging that maps to the slots being deleted/moved. > > Paolo can always help to keep me honest above. > > One thing I forgot to ask: iirc we used to have a workaround to kick all > vcpus out, update memory slots, then continue all vcpus. Would that work > for us too for the problem you're working on? As reference, here is one such approach for region resizes only: https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@xxxxxxxxxx/ which notes: "Instead of inhibiting during the region_resize(), we could inhibit for the hole memory transaction (from begin() to commit()). This could be nice, because also splitting of memory regions would be atomic (I remember there was a BUG report regarding that), however, I am not sure if that might impact any RT users." -- Thanks, David / dhildenb