Re: [PATCH v6 3/8] KVM: Add support for using dirty ring in conjunction with bitmap

Gavin Shan <gshan@xxxxxxxxxx> · Sat, 22 Oct 2022 16:27:41 +0800




Hi Sean,

On 10/22/22 12:05 AM, Sean Christopherson wrote:
On Fri, Oct 21, 2022, Marc Zyngier wrote:
On Fri, 21 Oct 2022 00:44:51 +0100,
Sean Christopherson <seanjc@xxxxxxxxxx> wrote:

On Tue, Oct 11, 2022, Gavin Shan wrote:
Some architectures (such as arm64) need to dirty memory outside of the
context of a vCPU. Of course, this simply doesn't fit with the UAPI of
KVM's per-vCPU dirty ring.

What is the point of using the dirty ring in this case?  KVM still
burns a pile of memory for the bitmap.  Is the benefit that
userspace can get away with scanning the bitmap fewer times,
e.g. scan it once just before blackout under the assumption that
very few pages will dirty the bitmap?

Apparently, the throttling effect of the ring makes it easier to
converge. Someone who actually uses the feature should be able to
tell you. But that's a policy decision, and I don't see why we should
be prescriptive.

I wasn't suggesting we be prescriptive, it was an honest question.

Why not add a global ring to @kvm?  I assume thread safety is a
problem, but the memory overhead of the dirty_bitmap also seems like
a fairly big problem.

Because we already have a stupidly bloated API surface, and that we
could do without yet another one based on a sample of *one*?

But we're adding a new API regardless.  A per-VM ring would definitely be a bigger
addition, but if using the dirty_bitmap won't actually meet the needs of userspace,
then we'll have added a new API and still not have solved the problem.  That's why
I was asking why/when userspace would want to use dirty_ring+dirty_bitmap.


Bitmap can help to solve the issue, but the extra memory consumption due to
the bitmap is a concern, as you mentioned previously. More information about
the issue can be found here [1]. On ARM64, multiple guest's physical pages are
used by VGIC/ITS to store its states during migration or system shutdown.

[1] https://lore.kernel.org/kvmarm/320005d1-fe88-fd6a-be91-ddb56f1aa80f@xxxxxxxxxx/

Because dirtying memory outside of a vcpu context makes it incredibly awkward
to handle a "ring full" condition?

Kicking all vCPUs with the soft-full request isn't _that_ awkward.  It's certainly
sub-optimal, but if inserting into the per-VM ring is relatively rare, then in
practice it's unlikely to impact guest performance.


It's still possible the per-vcpu-ring becomes hard full before it can be
kicked off. per-vm-ring has other issues, one of which is synchronization
between kvm and userspace to avoid overrunning per-kvm-ring. bitmap was
selected due to its simplicity.

Introduce a new flavor of dirty ring that requires the use of both vCPU
dirty rings and a dirty bitmap. The expectation is that for non-vCPU
sources of dirty memory (such as the GIC ITS on arm64), KVM writes to
the dirty bitmap. Userspace should scan the dirty bitmap before
migrating the VM to the target.

Use an additional capability to advertize this behavior and require
explicit opt-in to avoid breaking the existing dirty ring ABI. And yes,
you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do
not allow userspace to enable dirty ring if it hasn't also enabled the
ring && bitmap capability, as a VM is likely DOA without the pages
marked in the bitmap.

This is wrong. The *only* case this is useful is when there is an
in-kernel producer of data outside of the context of a vcpu, which is
so far only the ITS save mechanism. No ITS? No need for this.

How large is the ITS?  If it's a fixed, small size, could we treat the ITS as a
one-off case for now?  E.g. do something gross like shove entries into vcpu0's
dirty ring?


There are several VGIC/ITS tables involved in the issue. I checked the
specification and the implementation. As the device ID is 16-bits, so
the maximal devices can be 0x10000. Each device has its ITT (Interrupt
Translation Table), looked by a 32-bits event ID. The memory used for
ITT can be large enough in theory.

    Register       Description           Max-size   Entry-size  Max-entries
    -----------------------------------------------------------------------
    GITS_BASER0    ITS Device Table      512KB      8-bytes     0x10000
    GITS_BASER1    ITS Collection Table  512KB      8-bytes     0x10000
    GITS_BASER2    (GICv4) ITS VPE Table 512KB      8-bytes(?)  0x10000

    max-devices * (1UL << event_id_shift) * entry_size =
    0x10000 * (1UL << 32) * 8                          = 1PB

Userspace knows what it has created the first place, and should be in
charge of it (i.e. I want to be able to migrate my GICv2 and
GICv3-without-ITS VMs with the rings only).

Ah, so enabling the dirty bitmap isn't strictly required.  That means this patch
is wrong, and it also means that we need to figure out how we want to handle the
case where mark_page_dirty_in_slot() is invoked without a running vCPU on a memslot
without a dirty_bitmap.

I.e. what's an appropriate action in the below sequence:

void mark_page_dirty_in_slot(struct kvm *kvm,
			     const struct kvm_memory_slot *memslot,
		 	     gfn_t gfn)
{
	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();

#ifdef CONFIG_HAVE_KVM_DIRTY_RING
	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
		return;

#ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
	if (WARN_ON_ONCE(!vcpu))
		return;
#endif
#endif

	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
		unsigned long rel_gfn = gfn - memslot->base_gfn;
		u32 slot = (memslot->as_id << 16) | memslot->id;

		if (vcpu && kvm->dirty_ring_size)
			kvm_dirty_ring_push(&vcpu->dirty_ring,
					    slot, rel_gfn);
		else if (memslot->dirty_bitmap)
			set_bit_le(rel_gfn, memslot->dirty_bitmap);
		else
			???? <=================================================
	}
}


Would it be possible to require a dirty bitmap when an ITS is created?  That would
allow treating the above condition as a KVM bug.


According to the above calculation, it's impossible to determine the memory size for
the bitmap in advance. The memory used by ITE (Interrupt Translation Entry) tables
can be huge enough to use all guest's system memory in theory. ITE tables are scattered
in guest's system memory, but we don't know its location in advance. ITE tables are
created dynamically on requests from guest.

However, I think it's a good idea to enable the bitmap only when "arm-its-kvm" is
really used in userspace (QEMU). For example, the machine and (kvm) accelerator are
initialized like below. It's unknown if "arm-its-kvm" is used until (c). So we can
enable KVM_CAP_DIRTY_RING_WITH_BITMAP in (d) and the bitmap is created in (e) by
KVM.

  main
    qemu_init
      qemu_create_machine                   (a) machine instance is created
      configure_accelerators
        do_configure_accelerator
          accel_init_machine
            kvm_init                        (b) KVM is initialized
      :
      qmp_x_exit_preconfig
        qemu_init_board
          machine_run_board_init            (c) The board is initialized
      :
      accel_setup_post                      (d) KVM is post initialized
      :
      <migration>                           (e) Migration starts

In order to record if the bitmap is really needed, "struct kvm::dirty_ring_with_bitmap"
is still needed.

   - KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is advertised when CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
     is selected.

   - KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled in (d) only when "arm-its-kvm"
     is used in QEMU. After the capability is enabled, "struct kvm::dirty_ring_with_bitmap"
     is set to 1.

   - The bitmap is created by KVM in (e).

If the above analysis makes sense, I don't see there is anything missed from the patch
Of course, KVM_CAP_DIRTY_LOG_RING_{ACQ_REL, WITH_BITMAP} needs to be enabled separately
and don't depend on each other. the description added to "Documentation/virt/kvm/abi.rst"
need to be improved as Peter and Oliver suggested. kvm_dirty_ring_exclusive() needs to be
renamed to kvm_use_dirty_bitmap() and "#ifdef" needs to be cut down as Sean suggested.


@@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
  {
  	int r;
  
+#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
+	if (!kvm->dirty_ring_with_bitmap)
+		return -EINVAL;
+#endif

This one at least is prettier with IS_ENABLED

	if (IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) &&
	    !kvm->dirty_ring_with_bitmap)
		return -EINVAL;

But dirty_ring_with_bitmap really shouldn't need to exist.  It's
mandatory for architectures that have
HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for architectures
that don't.  In other words, the API for enabling the dirty ring is
a bit ugly.

Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been
officially released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP
on top, what about usurping bits 63:32 of cap->args[0] for flags?
E.g.

For posterity, filling in my missing idea...

Since the size is restricted to be well below a 32-bit value, and it's unlikely
that KVM will ever support 4GiB per-vCPU rings, we could usurp the upper bits for
flags:

   static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u64 arg0)
   {
	u32 flags = arg0 >> 32;
	u32 size = arg0;

However, since it sounds like enabling dirty_bitmap isn't strictly required, I
have no objection to enabling KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP, my objection
was purely that KVM was adding a per-VM flag just to sanity check the configuration.


If KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled for "arm-its-kvm", it'd better
to allow enabling those two capability (ACQ_REL and WITH_BITMAP) separately, as I
explained above. userspace (QEMU) will gain flexibility if these two capabilities
can be enabled separately.

To QEMU, KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL are accelerator's
properties. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is board's property. Relaxing their
dependency will give flexibility to QEMU.


[...]

Thanks,
Gavin