On Fri, 21 Oct 2022 17:05:26 +0100, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Fri, Oct 21, 2022, Marc Zyngier wrote: > > On Fri, 21 Oct 2022 00:44:51 +0100, > > Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > On Tue, Oct 11, 2022, Gavin Shan wrote: > > > > Some architectures (such as arm64) need to dirty memory outside of the > > > > context of a vCPU. Of course, this simply doesn't fit with the UAPI of > > > > KVM's per-vCPU dirty ring. > > > > > > What is the point of using the dirty ring in this case? KVM still > > > burns a pile of memory for the bitmap. Is the benefit that > > > userspace can get away with scanning the bitmap fewer times, > > > e.g. scan it once just before blackout under the assumption that > > > very few pages will dirty the bitmap? > > > > Apparently, the throttling effect of the ring makes it easier to > > converge. Someone who actually uses the feature should be able to > > tell you. But that's a policy decision, and I don't see why we should > > be prescriptive. > > I wasn't suggesting we be prescriptive, it was an honest question. > > > > Why not add a global ring to @kvm? I assume thread safety is a > > > problem, but the memory overhead of the dirty_bitmap also seems like > > > a fairly big problem. > > > > Because we already have a stupidly bloated API surface, and that we > > could do without yet another one based on a sample of *one*? > > But we're adding a new API regardless. A per-VM ring would > definitely be a bigger addition, but if using the dirty_bitmap won't > actually meet the needs of userspace, then we'll have added a new > API and still not have solved the problem. That's why I was asking > why/when userspace would want to use dirty_ring+dirty_bitmap. Whenever dirty pages can be generated outside of the context of a vcpu running. And that's anything that comes from *devices*. > > > Because dirtying memory outside of a vcpu context makes it > > incredibly awkward to handle a "ring full" condition? > > Kicking all vCPUs with the soft-full request isn't _that_ awkward. > It's certainly sub-optimal, but if inserting into the per-VM ring is > relatively rare, then in practice it's unlikely to impact guest > performance. But there is *nothing* to kick here. The kernel is dirtying pages, devices are dirtying pages (DMA), and there is no context associated with that. Which is why a finite ring is the wrong abstraction. > > > > > Introduce a new flavor of dirty ring that requires the use of both vCPU > > > > dirty rings and a dirty bitmap. The expectation is that for non-vCPU > > > > sources of dirty memory (such as the GIC ITS on arm64), KVM writes to > > > > the dirty bitmap. Userspace should scan the dirty bitmap before > > > > migrating the VM to the target. > > > > > > > > Use an additional capability to advertize this behavior and require > > > > explicit opt-in to avoid breaking the existing dirty ring ABI. And yes, > > > > you can use this with your preferred flavor of DIRTY_RING[_ACQ_REL]. Do > > > > not allow userspace to enable dirty ring if it hasn't also enabled the > > > > ring && bitmap capability, as a VM is likely DOA without the pages > > > > marked in the bitmap. > > > > This is wrong. The *only* case this is useful is when there is an > > in-kernel producer of data outside of the context of a vcpu, which is > > so far only the ITS save mechanism. No ITS? No need for this. > > How large is the ITS? If it's a fixed, small size, could we treat > the ITS as a one-off case for now? E.g. do something gross like > shove entries into vcpu0's dirty ring? The tables can be arbitrarily large, sparse, and are under control of the guest anyway. And no, I'm not entertaining anything that gross. I'm actually quite happy with not supporting the dirty ring and stick with the bitmap that doesn't have any of these problems. > > > Userspace knows what it has created the first place, and should be in > > charge of it (i.e. I want to be able to migrate my GICv2 and > > GICv3-without-ITS VMs with the rings only). > > Ah, so enabling the dirty bitmap isn't strictly required. That > means this patch is wrong, and it also means that we need to figure > out how we want to handle the case where mark_page_dirty_in_slot() > is invoked without a running vCPU on a memslot without a > dirty_bitmap. > > I.e. what's an appropriate action in the below sequence: > > void mark_page_dirty_in_slot(struct kvm *kvm, > const struct kvm_memory_slot *memslot, > gfn_t gfn) > { > struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); > > #ifdef CONFIG_HAVE_KVM_DIRTY_RING > if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm)) > return; > > #ifndef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP > if (WARN_ON_ONCE(!vcpu)) > return; > #endif > #endif > > if (memslot && kvm_slot_dirty_track_enabled(memslot)) { > unsigned long rel_gfn = gfn - memslot->base_gfn; > u32 slot = (memslot->as_id << 16) | memslot->id; > > if (vcpu && kvm->dirty_ring_size) > kvm_dirty_ring_push(&vcpu->dirty_ring, > slot, rel_gfn); > else if (memslot->dirty_bitmap) > set_bit_le(rel_gfn, memslot->dirty_bitmap); > else > ???? <================================================= > } > } > > > Would it be possible to require a dirty bitmap when an ITS is > created? That would allow treating the above condition as a KVM > bug. No. This should be optional. Everything about migration should be absolutely optional (I run plenty of concurrent VMs on sub-2GB systems). You want to migrate a VM that has an ITS or will collect dirty bits originating from a SMMU with HTTU, you enable the dirty bitmap. You want to have *vcpu* based dirty rings, you enable them. In short, there shouldn't be any reason for the two are either mandatory or conflated. Both should be optional, independent, because they cover completely disjoined use cases. *userspace* should be in charge of deciding this. > > > > > @@ -4499,6 +4507,11 @@ static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size) > > > > { > > > > int r; > > > > > > > > +#ifdef CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP > > > > + if (!kvm->dirty_ring_with_bitmap) > > > > + return -EINVAL; > > > > +#endif > > > > > > This one at least is prettier with IS_ENABLED > > > > > > if (IS_ENABLED(CONFIG_HAVE_KVM_DIRTY_RING_WITH_BITMAP) && > > > !kvm->dirty_ring_with_bitmap) > > > return -EINVAL; > > > > > > But dirty_ring_with_bitmap really shouldn't need to exist. It's > > > mandatory for architectures that have > > > HAVE_KVM_DIRTY_RING_WITH_BITMAP, and unsupported for architectures > > > that don't. In other words, the API for enabling the dirty ring is > > > a bit ugly. > > > > > > Rather than add KVM_CAP_DIRTY_LOG_RING_ACQ_REL, which hasn't been > > > officially released yet, and then KVM_CAP_DIRTY_LOG_ING_WITH_BITMAP > > > on top, what about usurping bits 63:32 of cap->args[0] for flags? > > > E.g. > > For posterity, filling in my missing idea... > > Since the size is restricted to be well below a 32-bit value, and it's unlikely > that KVM will ever support 4GiB per-vCPU rings, we could usurp the upper bits for > flags: > > static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u64 arg0) > { > u32 flags = arg0 >> 32; > u32 size = arg0; > > However, since it sounds like enabling dirty_bitmap isn't strictly required, I > have no objection to enabling KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP, my objection > was purely that KVM was adding a per-VM flag just to sanity check the configuration. > > > > Ideally we'd use cap->flags directly, but we screwed up with > > > KVM_CAP_DIRTY_LOG_RING and didn't require flags to be zero :-( > > > > > > Actually, what's the point of allowing > > > KVM_CAP_DIRTY_LOG_RING_ACQ_REL to be enabled? I get why KVM would > > > enumerate this info, i.e. allowing checking, but I don't seen any > > > value in supporting a second method for enabling the dirty ring. > > > > > > The acquire-release thing is irrelevant for x86, and no other > > > architecture supports the dirty ring until this series, i.e. there's > > > no need for KVM to detect that userspace has been updated to gain > > > acquire-release semantics, because the fact that userspace is > > > enabling the dirty ring on arm64 means userspace has been updated. > > > > Do we really need to make the API more awkward? There is an > > established pattern of "enable what is advertised". Some level of > > uniformity wouldn't hurt, really. > > I agree that uniformity would be nice, but for capabilities I don't > think that's ever going to happen. I'm pretty sure supporting > enabling is actually in the minority. E.g. of the 20 capabilities > handled in kvm_vm_ioctl_check_extension_generic(), I believe only 3 > support enabling (KVM_CAP_HALT_POLL, KVM_CAP_DIRTY_LOG_RING, and > KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2). I understood that you were advocating that a check for KVM_CAP_FOO could result in enabling KVM_CAP_BAR. That I definitely object to. M. -- Without deviation from the norm, progress is not possible.