On Fri, 21 Oct 2022 00:44:51 +0100, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Tue, Oct 11, 2022, Gavin Shan wrote: > > Some architectures (such as arm64) need to dirty memory outside of the > > context of a vCPU. Of course, this simply doesn't fit with the UAPI of > > KVM's per-vCPU dirty ring. > > What is the point of using the dirty ring in this case? KVM still > burns a pile of memory for the bitmap. Is the benefit that > userspace can get away with scanning the bitmap fewer times, > e.g. scan it once just before blackout under the assumption that > very few pages will dirty the bitmap? Apparently, the throttling effect of the ring makes it easier to converge. Someone who actually uses the feature should be able to tell you. But that's a policy decision, and I don't see why we should be prescriptive. > Why not add a global ring to @kvm? I assume thread safety is a > problem, but the memory overhead of the dirty_bitmap also seems like > a fairly big problem. Because we already have a stupidly bloated API surface, and that we could do without yet another one based on a sample of *one*? Because dirtying memory outside of a vcpu context makes it incredibly awkward to handle a "ring full" condition? > > > Introduce a new flavor of dirty ring that requires the use of both vCPU > > dirty rings and a dirty bitmap. The expectation is that for non-vCPU > > sources of dirty memory (such as the GIC ITS on arm64), KVM writes to > > the dirty bitmap. Userspace should scan the dirty bitmap before > > migrating the VM to the target. > > > > Use an additional capability to advertize this behavior and require > > explicit opt-in to avoid breaking the existing dirty ring ABI. And yes, > > you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do > > not allow userspace to enable dirty ring if it hasn't also enabled the > > ring && bitmap capability, as a VM is likely DOA without the pages > > marked in the bitmap. This is wrong. The *only* case this is useful is when there is an in-kernel producer of data outside of the context of a vcpu, which is so far only the ITS save mechanism. No ITS? No need for this. Userspace knows what it has created the first place, and should be in charge of it (i.e. I want to be able to migrate my GICv2 and GICv3-without-ITS VMs with the rings only). > > > > Suggested-by: Marc Zyngier <maz@xxxxxxxxxx> > > Suggested-by: Peter Xu <peterx@xxxxxxxxxx> > > Co-developed-by: Oliver Upton <oliver.upton@xxxxxxxxx> > > Co-developed-by needs Oliver's SoB. > > > #ifndef CONFIG_HAVE_KVM_DIRTY_RING > > +static inline bool kvm_dirty_ring_exclusive(struct kvm *kvm) > > What about inverting the naming to better capture that this is about the dirty > bitmap, and less so about the dirty ring? It's not obvious what "exclusive" > means, e.g. I saw this stub before reading the changelog and assumed it was > making a dirty ring exclusive to something. > > Something like this? > > bool kvm_use_dirty_bitmap(struct kvm *kvm) > { > return !kvm->dirty_ring_size || kvm->dirty_ring_with_bitmap; > } > > > @@ -3305,15 +3305,20 @@ void mark_page_dirty_in_slot(struct kvm *kvm, > > struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); > > > > #ifdef CONFIG_HAVE_KVM_DIRTY_RING > > - if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm)) > > + if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm)) > > return; > > + > > +#ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP > > + if (WARN_ON_ONCE(!vcpu)) > > To cut down on the #ifdefs, this can be: > > if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) && !vcpu) > > though that's arguably even harder to read. Blech. > > > + return; > > +#endif > > #endif > > > > if (memslot && kvm_slot_dirty_track_enabled(memslot)) { > > unsigned long rel_gfn = gfn - memslot->base_gfn; > > u32 slot = (memslot->as_id << 16) | memslot->id; > > > > - if (kvm->dirty_ring_size) > > + if (vcpu && kvm->dirty_ring_size) > > kvm_dirty_ring_push(&vcpu->dirty_ring, > > slot, rel_gfn); > > else > > @@ -4485,6 +4490,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) > > return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn); > > #else > > return 0; > > +#endif > > +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP > > + case KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP: > > #endif > > case KVM_CAP_BINARY_STATS_FD: > > case KVM_CAP_SYSTEM_EVENT_DATA: > > @@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size) > > { > > int r; > > > > +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP > > + if (!kvm->dirty_ring_with_bitmap) > > + return -EINVAL; > > +#endif > > This one at least is prettier with IS_ENABLED > > if (IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) && > !kvm->dirty_ring_with_bitmap) > return -EINVAL; > > But dirty_ring_with_bitmap really shouldn't need to exist. It's > mandatory for architectures that have > HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for architectures > that don't. In other words, the API for enabling the dirty ring is > a bit ugly. > > Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been > officially released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP > on top, what about usurping bits 63:32 of cap->args[0] for flags? > E.g. > > Ideally we'd use cap->flags directly, but we screwed up with > KVM_CAP_DIRTY_LOG_RING and didn't require flags to be zero :-( > > Actually, what's the point of allowing > KVM_CAP_DIRTY_LOG_RING_ACQ_REL to be enabled? I get why KVM would > enumerate this info, i.e. allowing checking, but I don't seen any > value in supporting a second method for enabling the dirty ring. > > The acquire-release thing is irrelevant for x86, and no other > architecture supports the dirty ring until this series, i.e. there's > no need for KVM to detect that userspace has been updated to gain > acquire-release semantics, because the fact that userspace is > enabling the dirty ring on arm64 means userspace has been updated. Do we really need to make the API more awkward? There is an established pattern of "enable what is advertised". Some level of uniformity wouldn't hurt, really. M. -- Without deviation from the norm, progress is not possible.