Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase

Peter Xu <peterx@xxxxxxxxxx> · Thu, 18 Aug 2022 16:04:56 -0400

On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote:
> +static void kvm_memory_region_node_add(KVMMemoryListener *kml,
> +                                       struct kvm_userspace_memory_region *mem)
> +{
> +    MemoryRegionNode *node;
> +
> +    node = g_malloc(sizeof(MemoryRegionNode));
> +    *node = (MemoryRegionNode) {
> +        .mem = mem,
> +    };

Nit: direct assignment of struct looks okay, but maybe pointer assignment
is clearer (with g_malloc0?  Or iirc we're suggested to always use g_new0):

  node = g_new0(MemoryRegionNode, 1);
  node->mem = mem;

[...]

> +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> +struct kvm_userspace_memory_region_list {
> +	__u32 nent;
> +	__u32 flags;
> +	struct kvm_userspace_memory_region entries[0];
> +};
> +
>  /*
>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>   * other bits are reserved for kvm internal use which are defined in
> @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
>  					struct kvm_userspace_memory_region)
>  #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
>  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
> +					struct kvm_userspace_memory_region_list)

I think this is probably good enough, but just to provide the other small
(but may not be important) piece of puzzle here.  I wanted to think through
to understand better but I never did..

For a quick look, please read the comment in kvm_set_phys_mem().

                /*
                 * NOTE: We should be aware of the fact that here we're only
                 * doing a best effort to sync dirty bits.  No matter whether
                 * we're using dirty log or dirty ring, we ignored two facts:
                 *
                 * (1) dirty bits can reside in hardware buffers (PML)
                 *
                 * (2) after we collected dirty bits here, pages can be dirtied
                 * again before we do the final KVM_SET_USER_MEMORY_REGION to
                 * remove the slot.
                 *
                 * Not easy.  Let's cross the fingers until it's fixed.
                 */

One example is if we have 16G mem, we enable dirty tracking and we punch a
hole of 1G at offset 1G, it'll change from this:

                     (a)
  |----------------- 16G -------------------|

To this:

     (b)    (c)              (d)
  |--1G--|XXXXXX|------------14G------------|

Here (c) will be a 1G hole.

With current code, the hole punching will del region (a) and add back
region (b) and (d).  After the new _LIST ioctl it'll be atomic and nicer.

Here the question is if we're with dirty tracking it means for each region
we have a dirty bitmap.  Currently we do the best effort of doing below
sequence:

  (1) fetching dirty bmap of (a)
  (2) delete region (a)
  (3) add region (b) (d)

Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose
dirty pages written between step (1) and (2) (and actually if the write
comes within (2) and (3) I think it'll crash qemu, and iiuc that's what
we're going to fix..).

So ideally the atomic op can be:

  "atomically fetch dirty bmap for removed regions, remove regions, and add
   new regions"

Rather than only:

  "atomically remove regions, and add new regions"

as what the new _LIST ioctl do.

But... maybe that's not a real problem, at least I didn't know any report
showing issue with current code yet caused by losing of dirty bits during
step (1) and (2).  Neither do I know how to trigger an issue with it.

I'm just trying to still provide this information so that you should be
aware of this problem too, at the meantime when proposing the new ioctl
change for qemu we should also keep in mind that we won't easily lose the
dirty bmap of (a) here, which I think this patch does the right thing.

Thanks!

--
Peter Xu