Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 9 Jan 2020 09:56:10 -0700

On Thu, 9 Jan 2020 11:29:28 -0500
"Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:

> On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote:
> > This patch is heavily based on previous work from Lei Cao
> > <lei.cao@xxxxxxxxxxx> and Paolo Bonzini <pbonzini@xxxxxxxxxx>. [1]
> > 
> > KVM currently uses large bitmaps to track dirty memory.  These bitmaps
> > are copied to userspace when userspace queries KVM for its dirty page
> > information.  The use of bitmaps is mostly sufficient for live
> > migration, as large parts of memory are be dirtied from one log-dirty
> > pass to another.  However, in a checkpointing system, the number of
> > dirty pages is small and in fact it is often bounded---the VM is
> > paused when it has dirtied a pre-defined number of pages. Traversing a
> > large, sparsely populated bitmap to find set bits is time-consuming,
> > as is copying the bitmap to user-space.
> > 
> > A similar issue will be there for live migration when the guest memory
> > is huge while the page dirty procedure is trivial.  In that case for
> > each dirty sync we need to pull the whole dirty bitmap to userspace
> > and analyse every bit even if it's mostly zeros.
> > 
> > The preferred data structure for above scenarios is a dense list of
> > guest frame numbers (GFN).  
> 
> No longer, this uses an array of structs.
> 
> >  This patch series stores the dirty list in
> > kernel memory that can be memory mapped into userspace to allow speedy
> > harvesting.
> > 
> > This patch enables dirty ring for X86 only.  However it should be
> > easily extended to other archs as well.
> > 
> > [1] https://patchwork.kernel.org/patch/10471409/
> > 
> > Signed-off-by: Lei Cao <lei.cao@xxxxxxxxxxx>
> > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> > ---
> >  Documentation/virt/kvm/api.txt  |  89 ++++++++++++++++++
> >  arch/x86/include/asm/kvm_host.h |   3 +
> >  arch/x86/include/uapi/asm/kvm.h |   1 +
> >  arch/x86/kvm/Makefile           |   3 +-
> >  arch/x86/kvm/mmu/mmu.c          |   6 ++
> >  arch/x86/kvm/vmx/vmx.c          |   7 ++
> >  arch/x86/kvm/x86.c              |   9 ++
> >  include/linux/kvm_dirty_ring.h  |  55 +++++++++++
> >  include/linux/kvm_host.h        |  26 +++++
> >  include/trace/events/kvm.h      |  78 +++++++++++++++
> >  include/uapi/linux/kvm.h        |  33 +++++++
> >  virt/kvm/dirty_ring.c           | 162 ++++++++++++++++++++++++++++++++
> >  virt/kvm/kvm_main.c             | 137 ++++++++++++++++++++++++++-
> >  13 files changed, 606 insertions(+), 3 deletions(-)
> >  create mode 100644 include/linux/kvm_dirty_ring.h
> >  create mode 100644 virt/kvm/dirty_ring.c
> > 
> > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> > index ebb37b34dcfc..708c3e0f7eae 100644
> > --- a/Documentation/virt/kvm/api.txt
> > +++ b/Documentation/virt/kvm/api.txt
> > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities.
> >  It is thus encouraged to use the vm ioctl to query for capabilities (available
> >  with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> >  
> > +
> >  4.5 KVM_GET_VCPU_MMAP_SIZE
> >  
> >  Capability: basic
> > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
> >  memory region.  This ioctl returns the size of that region.  See the
> >  KVM_RUN documentation for details.
> >  
> > +Besides the size of the KVM_RUN communication region, other areas of
> > +the VCPU file descriptor can be mmap-ed, including:
> > +
> > +- if KVM_CAP_COALESCED_MMIO is available, a page at
> > +  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
> > +  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
> > +  KVM_CAP_COALESCED_MMIO is not documented yet.
> > +
> > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> > +  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
> > +  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
> > +
> >  
> >  4.6 KVM_SET_MEMORY_REGION
> >  
> > @@ -5376,6 +5389,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using
> >  AArch64, this value will be reported in the ISS field of ESR_ELx.
> >  
> >  See KVM_CAP_VCPU_EVENTS for more details.
> > +
> >  8.20 KVM_CAP_HYPERV_SEND_IPI
> >  
> >  Architectures: x86
> > @@ -5383,6 +5397,7 @@ Architectures: x86
> >  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
> >  hypercalls:
> >  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> > +
> >  8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> >  
> >  Architecture: x86
> > @@ -5396,3 +5411,77 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
> >  flush hypercalls by Hyper-V) so userspace should disable KVM identification
> >  in CPUID and only exposes Hyper-V identification. In this case, guest
> >  thinks it's running on Hyper-V and only use Hyper-V hypercalls.
> > +
> > +8.22 KVM_CAP_DIRTY_LOG_RING
> > +
> > +Architectures: x86
> > +Parameters: args[0] - size of the dirty log ring
> > +
> > +KVM is capable of tracking dirty memory using ring buffers that are
> > +mmaped into userspace; there is one dirty ring per vcpu.
> > +
> > +One dirty ring is defined as below internally:
> > +
> > +struct kvm_dirty_ring {
> > +	u32 dirty_index;
> > +	u32 reset_index;
> > +	u32 size;
> > +	u32 soft_limit;
> > +	struct kvm_dirty_gfn *dirty_gfns;
> > +	struct kvm_dirty_ring_indices *indices;
> > +	int index;
> > +};
> > +
> > +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array.
> > +For each of the dirty entry it's defined as:
> > +
> > +struct kvm_dirty_gfn {
> > +        __u32 pad;  
> 
> How about sticking a length here?
> This way huge pages can be dirtied in one go.

Not just huge pages, but any contiguous range of dirty pages could be
reported far more concisely.  Thanks,

Alex