Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase

Peter Xu <peterx@xxxxxxxxxx> · Sat, 27 Aug 2022 16:58:39 -0400

Hi, Emanuele,

On Fri, Aug 26, 2022 at 04:07:01PM +0200, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 22/08/2022 um 16:10 schrieb Peter Xu:
> > On Thu, Aug 18, 2022 at 09:55:20PM -0300, Leonardo Bras Soares Passos wrote:
> >> On Thu, Aug 18, 2022 at 5:05 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >>>
> >>> On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote:
> >>>> +static void kvm_memory_region_node_add(KVMMemoryListener *kml,
> >>>> +                                       struct kvm_userspace_memory_region *mem)
> >>>> +{
> >>>> +    MemoryRegionNode *node;
> >>>> +
> >>>> +    node = g_malloc(sizeof(MemoryRegionNode));
> >>>> +    *node = (MemoryRegionNode) {
> >>>> +        .mem = mem,
> >>>> +    };
> >>>
> >>> Nit: direct assignment of struct looks okay, but maybe pointer assignment
> >>> is clearer (with g_malloc0?  Or iirc we're suggested to always use g_new0):
> >>>
> >>>   node = g_new0(MemoryRegionNode, 1);
> >>>   node->mem = mem;
> >>>
> >>> [...]
> 
> Makes sense
> 
> >>>
> >>>> +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> >>>> +struct kvm_userspace_memory_region_list {
> >>>> +     __u32 nent;
> >>>> +     __u32 flags;
> >>>> +     struct kvm_userspace_memory_region entries[0];
> >>>> +};
> >>>> +
> >>>>  /*
> >>>>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> >>>>   * other bits are reserved for kvm internal use which are defined in
> >>>> @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
> >>>>                                       struct kvm_userspace_memory_region)
> >>>>  #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
> >>>>  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> >>>> +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
> >>>> +                                     struct kvm_userspace_memory_region_list)
> >>>
> >>> I think this is probably good enough, but just to provide the other small
> >>> (but may not be important) piece of puzzle here.  I wanted to think through
> >>> to understand better but I never did..
> >>>
> >>> For a quick look, please read the comment in kvm_set_phys_mem().
> >>>
> >>>                 /*
> >>>                  * NOTE: We should be aware of the fact that here we're only
> >>>                  * doing a best effort to sync dirty bits.  No matter whether
> >>>                  * we're using dirty log or dirty ring, we ignored two facts:
> >>>                  *
> >>>                  * (1) dirty bits can reside in hardware buffers (PML)
> >>>                  *
> >>>                  * (2) after we collected dirty bits here, pages can be dirtied
> >>>                  * again before we do the final KVM_SET_USER_MEMORY_REGION to
> >>>                  * remove the slot.
> >>>                  *
> >>>                  * Not easy.  Let's cross the fingers until it's fixed.
> >>>                  */
> >>>
> >>> One example is if we have 16G mem, we enable dirty tracking and we punch a
> >>> hole of 1G at offset 1G, it'll change from this:
> >>>
> >>>                      (a)
> >>>   |----------------- 16G -------------------|
> >>>
> >>> To this:
> >>>
> >>>      (b)    (c)              (d)
> >>>   |--1G--|XXXXXX|------------14G------------|
> >>>
> >>> Here (c) will be a 1G hole.
> >>>
> >>> With current code, the hole punching will del region (a) and add back
> >>> region (b) and (d).  After the new _LIST ioctl it'll be atomic and nicer.
> >>>
> >>> Here the question is if we're with dirty tracking it means for each region
> >>> we have a dirty bitmap.  Currently we do the best effort of doing below
> >>> sequence:
> >>>
> >>>   (1) fetching dirty bmap of (a)
> >>>   (2) delete region (a)
> >>>   (3) add region (b) (d)
> >>>
> >>> Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose
> >>> dirty pages written between step (1) and (2) (and actually if the write
> >>> comes within (2) and (3) I think it'll crash qemu, and iiuc that's what
> >>> we're going to fix..).
> >>>
> >>> So ideally the atomic op can be:
> >>>
> >>>   "atomically fetch dirty bmap for removed regions, remove regions, and add
> >>>    new regions"
> >>>
> >>> Rather than only:
> >>>
> >>>   "atomically remove regions, and add new regions"
> >>>
> >>> as what the new _LIST ioctl do.
> >>>
> >>> But... maybe that's not a real problem, at least I didn't know any report
> >>> showing issue with current code yet caused by losing of dirty bits during
> >>> step (1) and (2).  Neither do I know how to trigger an issue with it.
> >>>
> >>> I'm just trying to still provide this information so that you should be
> >>> aware of this problem too, at the meantime when proposing the new ioctl
> >>> change for qemu we should also keep in mind that we won't easily lose the
> >>> dirty bmap of (a) here, which I think this patch does the right thing.
> >>>
> >>
> >> Thanks for bringing these details Peter!
> >>
> >> What do you think of adding?
> >> (4) Copy the corresponding part of (a)'s dirty bitmap to (b) and (d)'s
> >> dirty bitmaps.
> > 
> > Sounds good to me, but may not cover dirty ring?  Maybe we could move on
> > with the simple but clean scheme first and think about a comprehensive
> > option only if very necessary.  The worst case is we need one more kvm cap
> > but we should still have enough.
> 
> Ok then, I will not consider this for now.
> 
> Might or might not be relevant, but I was also considering to
> pre-process the list of memslots passed to the ioctl and merge
> operations when necessary, to avoid unnecessary operations.
> 
> For example, if we are creating an area and punching a hole (assuming
> this is a valid example), we would have
> 
> transaction_begin()
> CREATE(offset=0, memory area)
> DELETE(memory area)
> CREATE(offset=0, memory area / 2 - 1)
> CREATE(offset=memory_area/2, memory area / 2)
> transaction_commmit()
> 
> Instead, if we pre-process the memory regions and detect overlaps with
> an interval tree, we could simplify the above with:
> CREATE(offset=0, memory area / 2 - 1)
> CREATE(offset=memory_area/2, memory area / 2)

As I replied in the private email, I don't think the pre-process here is
needed, because afaict flat views already handle that.

See generate_memory_topology() and especially render_memory_region().

In above example, the 1st CREATE + DELETE shouldn't reach any of the memory
listners, including the KVM one, because the flatview only contains the
final layout of the address space when commit() happens.

> 
> In addition, I was thinking to also provide the operation type (called
> enum kvm_mr_change) from userspace directly, and not "figure" it
> ourselves in KVM.
> 
> The reason for these two changes come from KVM implementation. I know
> this is no KVM place, but a background explanation might be necessary.
> Basically KVM 1) figures the request type by looking at the fields
> passed by userspace (for example mem_size == 0 means DELETE) and the
> current status of the memslot (id not found means CREATE, found means
> MOVE/UPDATE_FLAGS), and 2) has 2 memslot lists per address space: the
> active (visible) and inactive. Any operation is performed in the
> inactive list, then we "swap" the two so that the change is visible.
> 
> The "atomic" goal of this serie just means that we want to apply
> multiple memslot operation and then perform a single "swap".
> The problem is that each DELETE and MOVE request perform 2 swaps: first
> substitute current memslot with an invalid one (so that page faults are
> retried), and then remove the invalid one (in case of MOVE, substitute
> with new memslot).
> 
> Therefore:
> - KVM should ideally pre-process all DELETE/MOVE memslots and perform a
> first swap(s) to replace the current memslot with an invalid one, and
> then perform all other operations in order (deletion/move included).

This sounds correct.

> This is why QEMU should just send pre-merged MR, so that we don't have
> CREATE(x)- DELETE(x). Otherwise we would process a DELETE on a MR that
> doesn't exist yet.

As discussed above, IMHO pre-merge isn't an issue here and I think the
sequence in your example won't happen.  But you raised a great question on
whether we should allow the new ioctl (if there's going to be one) to do
whatever it wants by batching any sequence of memslot operations.

More below.

> 
> - Doing the above is still not enough, as KVM figures what operation to
> do depending on the current state of the memslots.
> Assuming we already have an already existing MR y, and now we get the
> list DELETE(y) CREATE(y/2) (ie reducing y to half its size).
> In this case the interval tree can't do anything, but it's very hard to
> understand that the second request in the list is a CREATE, because when
> KVM performs the check to understand which type of operation it is
> (before doing any modification at all in the memslot list), it finds
> that y (memslot with same id) exist, but has a different size than what
> provided from userspace, therefore it could either fail, or misinterpret
> it as another operation (currently fails -EINVALID).

Another good question..  I think that can be partly solved by not allowing
ID duplication in the same batched ioctl, or maybe we can fail it.  From
qemu perspective, we may need to change the memslot id allocation to not
reuse IDs of being-deleted memslots until the batch is processed.

Something like adding similar INVALID tags to kvm memslots in QEMU when we
are preparing the batch, then we should only reset memory_size==0 and clear
INVALID tag after the ioctl returns.  Then during the batch, any new slots
to be added from kvm_get_free_slot() will not return any duplication of a
deleting one.

> If we instead already provide the labels, then we can:
> 	1. substitute the memslots pointed by DELETE/MOVE with invalid & swap
> (so it is visible, non-atomic. But we don't care, as we are not deleting
> anything)
> 	2. delete the invalid memslot (in the inactive memslot list)
> 	3. process the other requests (in the inactive memslot list)
> 	4. single and atomic swap (step 2 and 3 are now visible).
> 
> What do you think?

Adding some limitation to the new ioctl makes sense to me, especially
moving the DELETE/MOVE to be handled earlier, rather than a complete
mixture of all of them.

I'm wondering whether the batch ioctl can be layed out separately on the
operations, then it can be clearly documented on the sequence of handling
each op:

  struct kvm_userspace_memory_region_batch {
         struct kvm_userspace_memory_region deletes[];
         struct kvm_userspace_memory_region moves[];
         struct kvm_userspace_memory_region creates[];
         struct kvm_userspace_memory_region flags_only[];
  };

So that the ordering can be very obvious.  But I didn't really think deeper
than that.

Side note: do we use MOVE at all in QEMU?

> 
> Bonus question: with this atomic operation, do we really need the
> invalid memslot logic for MOVE/DELETE?

I think so.  Not an expert on that, but iiuc that's to make sure we'll zap
all shadow paging that maps to the slots being deleted/moved.

Paolo can always help to keep me honest above.

One thing I forgot to ask: iirc we used to have a workaround to kick all
vcpus out, update memory slots, then continue all vcpus.  Would that work
for us too for the problem you're working on?

Thanks,

-- 
Peter Xu