On Thu, Aug 18, 2022 at 09:55:20PM -0300, Leonardo Bras Soares Passos wrote: > On Thu, Aug 18, 2022 at 5:05 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote: > > > +static void kvm_memory_region_node_add(KVMMemoryListener *kml, > > > + struct kvm_userspace_memory_region *mem) > > > +{ > > > + MemoryRegionNode *node; > > > + > > > + node = g_malloc(sizeof(MemoryRegionNode)); > > > + *node = (MemoryRegionNode) { > > > + .mem = mem, > > > + }; > > > > Nit: direct assignment of struct looks okay, but maybe pointer assignment > > is clearer (with g_malloc0? Or iirc we're suggested to always use g_new0): > > > > node = g_new0(MemoryRegionNode, 1); > > node->mem = mem; > > > > [...] > > > > > +/* for KVM_SET_USER_MEMORY_REGION_LIST */ > > > +struct kvm_userspace_memory_region_list { > > > + __u32 nent; > > > + __u32 flags; > > > + struct kvm_userspace_memory_region entries[0]; > > > +}; > > > + > > > /* > > > * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, > > > * other bits are reserved for kvm internal use which are defined in > > > @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce { > > > struct kvm_userspace_memory_region) > > > #define KVM_SET_TSS_ADDR _IO(KVMIO, 0x47) > > > #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO, 0x48, __u64) > > > +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \ > > > + struct kvm_userspace_memory_region_list) > > > > I think this is probably good enough, but just to provide the other small > > (but may not be important) piece of puzzle here. I wanted to think through > > to understand better but I never did.. > > > > For a quick look, please read the comment in kvm_set_phys_mem(). > > > > /* > > * NOTE: We should be aware of the fact that here we're only > > * doing a best effort to sync dirty bits. No matter whether > > * we're using dirty log or dirty ring, we ignored two facts: > > * > > * (1) dirty bits can reside in hardware buffers (PML) > > * > > * (2) after we collected dirty bits here, pages can be dirtied > > * again before we do the final KVM_SET_USER_MEMORY_REGION to > > * remove the slot. > > * > > * Not easy. Let's cross the fingers until it's fixed. > > */ > > > > One example is if we have 16G mem, we enable dirty tracking and we punch a > > hole of 1G at offset 1G, it'll change from this: > > > > (a) > > |----------------- 16G -------------------| > > > > To this: > > > > (b) (c) (d) > > |--1G--|XXXXXX|------------14G------------| > > > > Here (c) will be a 1G hole. > > > > With current code, the hole punching will del region (a) and add back > > region (b) and (d). After the new _LIST ioctl it'll be atomic and nicer. > > > > Here the question is if we're with dirty tracking it means for each region > > we have a dirty bitmap. Currently we do the best effort of doing below > > sequence: > > > > (1) fetching dirty bmap of (a) > > (2) delete region (a) > > (3) add region (b) (d) > > > > Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose > > dirty pages written between step (1) and (2) (and actually if the write > > comes within (2) and (3) I think it'll crash qemu, and iiuc that's what > > we're going to fix..). > > > > So ideally the atomic op can be: > > > > "atomically fetch dirty bmap for removed regions, remove regions, and add > > new regions" > > > > Rather than only: > > > > "atomically remove regions, and add new regions" > > > > as what the new _LIST ioctl do. > > > > But... maybe that's not a real problem, at least I didn't know any report > > showing issue with current code yet caused by losing of dirty bits during > > step (1) and (2). Neither do I know how to trigger an issue with it. > > > > I'm just trying to still provide this information so that you should be > > aware of this problem too, at the meantime when proposing the new ioctl > > change for qemu we should also keep in mind that we won't easily lose the > > dirty bmap of (a) here, which I think this patch does the right thing. > > > > Thanks for bringing these details Peter! > > What do you think of adding? > (4) Copy the corresponding part of (a)'s dirty bitmap to (b) and (d)'s > dirty bitmaps. Sounds good to me, but may not cover dirty ring? Maybe we could move on with the simple but clean scheme first and think about a comprehensive option only if very necessary. The worst case is we need one more kvm cap but we should still have enough. Thanks, -- Peter Xu