Re: [PATCH v6 3/8] KVM: Add support for using dirty ring in conjunction with bitmap

Marc Zyngier <maz@xxxxxxxxxx> · Sat, 22 Oct 2022 11:54:39 +0100

On Sat, 22 Oct 2022 09:27:41 +0100,
Gavin Shan <gshan@xxxxxxxxxx> wrote:
> 
> Hi Sean,
> 
> On 10/22/22 12:05 AM, Sean Christopherson wrote:
> > On Fri, Oct 21, 2022, Marc Zyngier wrote:
> >> On Fri, 21 Oct 2022 00:44:51 +0100,
> >> Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >>> 
> >>> On Tue, Oct 11, 2022, Gavin Shan wrote:
> >>>> Some architectures (such as arm64) need to dirty memory outside of the
> >>>> context of a vCPU. Of course, this simply doesn't fit with the UAPI of
> >>>> KVM's per-vCPU dirty ring.
> >>> 
> >>> What is the point of using the dirty ring in this case?  KVM still
> >>> burns a pile of memory for the bitmap.  Is the benefit that
> >>> userspace can get away with scanning the bitmap fewer times,
> >>> e.g. scan it once just before blackout under the assumption that
> >>> very few pages will dirty the bitmap?
> >> 
> >> Apparently, the throttling effect of the ring makes it easier to
> >> converge. Someone who actually uses the feature should be able to
> >> tell you. But that's a policy decision, and I don't see why we should
> >> be prescriptive.
> > 
> > I wasn't suggesting we be prescriptive, it was an honest question.
> > 
> >>> Why not add a global ring to @kvm?  I assume thread safety is a
> >>> problem, but the memory overhead of the dirty_bitmap also seems like
> >>> a fairly big problem.
> >> 
> >> Because we already have a stupidly bloated API surface, and that we
> >> could do without yet another one based on a sample of *one*?
> > 
> > But we're adding a new API regardless.  A per-VM ring would definitely be a bigger
> > addition, but if using the dirty_bitmap won't actually meet the needs of userspace,
> > then we'll have added a new API and still not have solved the problem.  That's why
> > I was asking why/when userspace would want to use dirty_ring+dirty_bitmap.
> > 
> 
> Bitmap can help to solve the issue, but the extra memory consumption due to
> the bitmap is a concern, as you mentioned previously. More information about
> the issue can be found here [1]. On ARM64, multiple guest's physical pages are
> used by VGIC/ITS to store its states during migration or system shutdown.
> 
> [1] https://lore.kernel.org/kvmarm/320005d1-fe88-fd6a-be91-ddb56f1aa80f@xxxxxxxxxx/
> 
> >> Because dirtying memory outside of a vcpu context makes it incredibly awkward
> >> to handle a "ring full" condition?
> > 
> > Kicking all vCPUs with the soft-full request isn't _that_ awkward.  It's certainly
> > sub-optimal, but if inserting into the per-VM ring is relatively rare, then in
> > practice it's unlikely to impact guest performance.
> > 
> 
> It's still possible the per-vcpu-ring becomes hard full before it can be
> kicked off. per-vm-ring has other issues, one of which is synchronization
> between kvm and userspace to avoid overrunning per-kvm-ring. bitmap was
> selected due to its simplicity.

Exactly. And once you overflow a ring because the device generate too
much data, what do you do? Return an error to the device?

> 
> >>>> Introduce a new flavor of dirty ring that requires the use of both vCPU
> >>>> dirty rings and a dirty bitmap. The expectation is that for non-vCPU
> >>>> sources of dirty memory (such as the GIC ITS on arm64), KVM writes to
> >>>> the dirty bitmap. Userspace should scan the dirty bitmap before
> >>>> migrating the VM to the target.
> >>>> 
> >>>> Use an additional capability to advertize this behavior and require
> >>>> explicit opt-in to avoid breaking the existing dirty ring ABI. And yes,
> >>>> you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do
> >>>> not allow userspace to enable dirty ring if it hasn't also enabled the
> >>>> ring && bitmap capability, as a VM is likely DOA without the pages
> >>>> marked in the bitmap.
> >> 
> >> This is wrong. The *only* case this is useful is when there is an
> >> in-kernel producer of data outside of the context of a vcpu, which is
> >> so far only the ITS save mechanism. No ITS? No need for this.
> > 
> > How large is the ITS?  If it's a fixed, small size, could we treat the ITS as a
> > one-off case for now?  E.g. do something gross like shove entries into vcpu0's
> > dirty ring?
> > 
> 
> There are several VGIC/ITS tables involved in the issue. I checked the
> specification and the implementation. As the device ID is 16-bits, so
> the maximal devices can be 0x10000. Each device has its ITT (Interrupt
> Translation Table), looked by a 32-bits event ID. The memory used for
> ITT can be large enough in theory.
> 
>     Register       Description           Max-size   Entry-size  Max-entries
>     -----------------------------------------------------------------------
>     GITS_BASER0    ITS Device Table      512KB      8-bytes     0x10000
>     GITS_BASER1    ITS Collection Table  512KB      8-bytes     0x10000

Both can be two levels. So you can multiply the max size by 64K. The
entry size also depends on the revision of the ABI and can be changed
anytime we see fit.

>     GITS_BASER2    (GICv4) ITS VPE Table 512KB      8-bytes(?)  0x10000

We don't virtualise GICv4. We use GICv4 to virtualise a GICv3. So this
table will never be saved (the guest never sees it, and only KVM
manages it).

>     max-devices * (1UL << event_id_shift) * entry_size =
>     0x10000 * (1UL << 32) * 8                          = 1PB
> 
> >> Userspace knows what it has created the first place, and should be in
> >> charge of it (i.e. I want to be able to migrate my GICv2 and
> >> GICv3-without-ITS VMs with the rings only).
> > 
> > Ah, so enabling the dirty bitmap isn't strictly required.  That means this patch
> > is wrong, and it also means that we need to figure out how we want to handle the
> > case where mark_page_dirty_in_slot() is invoked without a running vCPU on a memslot
> > without a dirty_bitmap.
> > 
> > I.e. what's an appropriate action in the below sequence:
> > 
> > void mark_page_dirty_in_slot(struct kvm *kvm,
> > 			     const struct kvm_memory_slot *memslot,
> > 		 	     gfn_t gfn)
> > {
> > 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> > 
> > #ifdef CONFIG_HAVE_KVM_DIRTY_RING
> > 	if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm))
> > 		return;
> > 
> > #ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> > 	if (WARN_ON_ONCE(!vcpu))
> > 		return;
> > #endif
> > #endif
> > 
> > 	if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
> > 		unsigned long rel_gfn = gfn - memslot->base_gfn;
> > 		u32 slot = (memslot->as_id << 16) | memslot->id;
> > 
> > 		if (vcpu && kvm->dirty_ring_size)
> > 			kvm_dirty_ring_push(&vcpu->dirty_ring,
> > 					    slot, rel_gfn);
> > 		else if (memslot->dirty_bitmap)
> > 			set_bit_le(rel_gfn, memslot->dirty_bitmap);
> > 		else
> > 			???? <=================================================
> > 	}
> > }
> > 
> > 
> > Would it be possible to require a dirty bitmap when an ITS is
> > created?  That would allow treating the above condition as a KVM
> > bug.
> > 
> 
> According to the above calculation, it's impossible to determine the
> memory size for the bitmap in advance. The memory used by ITE
> (Interrupt Translation Entry) tables can be huge enough to use all
> guest's system memory in theory. ITE tables are scattered in guest's
> system memory, but we don't know its location in advance. ITE tables
> are created dynamically on requests from guest.
> 
> However, I think it's a good idea to enable the bitmap only when
> "arm-its-kvm" is really used in userspace (QEMU). For example, the
> machine and (kvm) accelerator are initialized like below. It's
> unknown if "arm-its-kvm" is used until (c). So we can enable
> KVM_CAP_DIRTY_RING_WITH_BITMAP in (d) and the bitmap is created in
> (e) by KVM.
> 
>   main
>     qemu_init
>       qemu_create_machine                   (a) machine instance is created
>       configure_accelerators
>         do_configure_accelerator
>           accel_init_machine
>             kvm_init                        (b) KVM is initialized
>       :
>       qmp_x_exit_preconfig
>         qemu_init_board
>           machine_run_board_init            (c) The board is initialized
>       :
>       accel_setup_post                      (d) KVM is post initialized
>       :
>       <migration>                           (e) Migration starts
> 
> In order to record if the bitmap is really needed, "struct
> kvm::dirty_ring_with_bitmap" is still needed.
> 
>    - KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is advertised when
>      CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP is selected.
> 
>    - KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled in (d) only when
>      "arm-its-kvm" is used in QEMU. After the capability is enabled,
>      "struct kvm::dirty_ring_with_bitmap" is set to 1.
> 
>    - The bitmap is created by KVM in (e).
> 
> If the above analysis makes sense, I don't see there is anything
> missed from the patch Of course, KVM_CAP_DIRTY_LOG_RING_{ACQ_REL,
> WITH_BITMAP} needs to be enabled separately and don't depend on each
> other. the description added to "Documentation/virt/kvm/abi.rst"
> need to be improved as Peter and Oliver suggested.
> kvm_dirty_ring_exclusive() needs to be renamed to
> kvm_use_dirty_bitmap() and "#ifdef" needs to be cut down as Sean
> suggested.

Frankly, I really hate the "mayo and ketchup" approach. Both dirty
tracking approaches serve different purpose, and I really don't see
the point in merging them behind a single cap. Userspace should be
able to chose if and when it wants to use a logging method or another.
We should document how they interact, but that's about it.

> 
> 
> >>>> @@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
> >>>>   {
> >>>>   	int r;
> >>>>   +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP
> >>>> +	if (!kvm->dirty_ring_with_bitmap)
> >>>> +		return -EINVAL;
> >>>> +#endif
> >>> 
> >>> This one at least is prettier with IS_ENABLED
> >>> 
> >>> 	if (IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) &&
> >>> 	    !kvm->dirty_ring_with_bitmap)
> >>> 		return -EINVAL;
> >>> 
> >>> But dirty_ring_with_bitmap really shouldn't need to exist.  It's
> >>> mandatory for architectures that have
> >>> HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for architectures
> >>> that don't.  In other words, the API for enabling the dirty ring is
> >>> a bit ugly.
> >>> 
> >>> Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been
> >>> officially released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP
> >>> on top, what about usurping bits 63:32 of cap->args[0] for flags?
> >>> E.g.
> > 
> > For posterity, filling in my missing idea...
> > 
> > Since the size is restricted to be well below a 32-bit value, and it's unlikely
> > that KVM will ever support 4GiB per-vCPU rings, we could usurp the upper bits for
> > flags:
> > 
> >    static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u64 arg0)
> >    {
> > 	u32 flags = arg0 >> 32;
> > 	u32 size = arg0;
> > 
> > However, since it sounds like enabling dirty_bitmap isn't strictly
> > required, I have no objection to enabling
> > KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP, my objection was purely that
> > KVM was adding a per-VM flag just to sanity check the
> > configuration.
> > 
> 
> If KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is enabled for "arm-its-kvm",
> it'd better to allow enabling those two capability (ACQ_REL and
> WITH_BITMAP) separately, as I explained above. userspace (QEMU) will
> gain flexibility if these two capabilities can be enabled
> separately.
> 
> To QEMU, KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL
> are accelerator's properties. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP is
> board's property. Relaxing their dependency will give flexibility to
> QEMU.

That.

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm