On Sun, 06 Nov 2022 21:40:49 +0000, Gavin Shan <gshan@xxxxxxxxxx> wrote: > > Hi Marc, > > On 11/6/22 11:43 PM, Marc Zyngier wrote: > > On Fri, 04 Nov 2022 23:40:45 +0000, > > Gavin Shan <gshan@xxxxxxxxxx> wrote: > >> > >> ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is > >> enabled. It's conflicting with that ring-based dirty page tracking always > >> requires a running VCPU context. > >> > >> Introduce a new flavor of dirty ring that requires the use of both VCPU > >> dirty rings and a dirty bitmap. The expectation is that for non-VCPU > >> sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to > >> the dirty bitmap. Userspace should scan the dirty bitmap before migrating > >> the VM to the target. > >> > >> Use an additional capability to advertise this behavior. The newly added > >> capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before > >> KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added > >> capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL. > >> > >> Suggested-by: Marc Zyngier <maz@xxxxxxxxxx> > >> Suggested-by: Peter Xu <peterx@xxxxxxxxxx> > >> Co-developed-by: Oliver Upton <oliver.upton@xxxxxxxxx> > >> Signed-off-by: Oliver Upton <oliver.upton@xxxxxxxxx> > >> Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx> > >> Acked-by: Peter Xu <peterx@xxxxxxxxxx> > >> --- > >> Documentation/virt/kvm/api.rst | 33 ++++++++++++++++++----- > >> include/linux/kvm_dirty_ring.h | 7 +++++ > >> include/linux/kvm_host.h | 1 + > >> include/uapi/linux/kvm.h | 1 + > >> virt/kvm/Kconfig | 8 ++++++ > >> virt/kvm/dirty_ring.c | 10 +++++++ > >> virt/kvm/kvm_main.c | 49 +++++++++++++++++++++++++++------- > >> 7 files changed, 93 insertions(+), 16 deletions(-) > >> > >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > >> index eee9f857a986..2ec32bd41792 100644 > >> --- a/Documentation/virt/kvm/api.rst > >> +++ b/Documentation/virt/kvm/api.rst > >> @@ -8003,13 +8003,6 @@ flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one > >> needs to kick the vcpu out of KVM_RUN using a signal. The resulting > >> vmexit ensures that all dirty GFNs are flushed to the dirty rings. > >> -NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the > >> corresponding > >> -ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls > >> -KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG. After enabling > >> -KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual > >> -machine will switch to ring-buffer dirty page tracking and further > >> -KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail. > >> - > >> NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that > >> should be exposed by weakly ordered architecture, in order to indicate > >> the additional memory ordering requirements imposed on userspace when > >> @@ -8018,6 +8011,32 @@ Architecture with TSO-like ordering (such as x86) are allowed to > >> expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL > >> to userspace. > >> +After using the dirty rings, the userspace needs to detect the > >> capability > > > > using? or enabling? What comes after suggest the latter. > > > > s/using/enabling in next revision :) > > >> +of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the ring structures > >> +need to be backed by per-slot bitmaps. With this capability advertised > >> +and supported, it means the architecture can dirty guest pages without > > > > If it is advertised, it is supported, right? > > > > Yes, s/advertised and supported/advertised in next revision. > > >> +vcpu/ring context, so that some of the dirty information will still be > >> +maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP > >> +can't be enabled until the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL > >> +has been enabled. > >> + > >> +Note that the bitmap here is only a backup of the ring structure, and > >> +normally should only contain a very small amount of dirty pages, which > > > > I don't think we can claim this. It is whatever amount of memory is > > dirtied outside of a vcpu context, and we shouldn't make any claim > > regarding the number of dirty pages. > > > > It's the pre-requisite to use the backup bitmap. Otherwise, the guest > will experience long down-time during migration, as mentioned by Peter > in another thread. So it's appropriate to mention the limit of dirty > pages here. See my alternative wording for this in the other sub-thread. > > >> +needs to be transferred during VM downtime. Collecting the dirty bitmap > >> +should be the very last thing that the VMM does before transmitting state > >> +to the target VM. VMM needs to ensure that the dirty state is final and > >> +avoid missing dirty pages from another ioctl ordered after the bitmap > >> +collection. > >> + > >> +To collect dirty bits in the backup bitmap, the userspace can use the > >> +same KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG shouldn't be needed > >> +and its behavior is undefined since collecting the dirty bitmap always > >> +happens in the last phase of VM's migration. > > > > It isn't clear to me why KVM_CLEAR_DIRTY_LOG should be called out. If > > you have multiple devices that dirty the memory, such as multiple > > ITSs, why shouldn't userspace be allowed to snapshot the dirty state > > multiple time? This doesn't seem like a reasonable restriction, and I > > really dislike the idea of undefined behaviour here. > > > > It was actually documenting the expected QEMU's usage. With QEMU > excluded, KVM_CLEAR_DIRTY_LOG can be used as usual. Undefined behavior > seems not precise here. We can improve it like below, to avoid talking > about 'undefined behaviour'. > > To collect dirty bits in the backup bitmap, the userspace can use the > same KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG shouldn't be needed > since collecting the dirty bitmap always happens in the last phase of > VM's migration. That's better, but the "shouldn't be needed" wording makes things ambiguous, and we shouldn't mention migration at all (this is not the only purpose of this API). I'd suggest this: To collect dirty bits in the backup bitmap, userspace can use the same KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all the generation of the dirty bits is done in a single pass. Thanks, M. -- Without deviation from the norm, progress is not possible.