On Thu, 9 Jan 2020 11:29:28 -0500 "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote: > On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote: > > This patch is heavily based on previous work from Lei Cao > > <lei.cao@xxxxxxxxxxx> and Paolo Bonzini <pbonzini@xxxxxxxxxx>. [1] > > > > KVM currently uses large bitmaps to track dirty memory. These bitmaps > > are copied to userspace when userspace queries KVM for its dirty page > > information. The use of bitmaps is mostly sufficient for live > > migration, as large parts of memory are be dirtied from one log-dirty > > pass to another. However, in a checkpointing system, the number of > > dirty pages is small and in fact it is often bounded---the VM is > > paused when it has dirtied a pre-defined number of pages. Traversing a > > large, sparsely populated bitmap to find set bits is time-consuming, > > as is copying the bitmap to user-space. > > > > A similar issue will be there for live migration when the guest memory > > is huge while the page dirty procedure is trivial. In that case for > > each dirty sync we need to pull the whole dirty bitmap to userspace > > and analyse every bit even if it's mostly zeros. > > > > The preferred data structure for above scenarios is a dense list of > > guest frame numbers (GFN). > > No longer, this uses an array of structs. > > > This patch series stores the dirty list in > > kernel memory that can be memory mapped into userspace to allow speedy > > harvesting. > > > > This patch enables dirty ring for X86 only. However it should be > > easily extended to other archs as well. > > > > [1] https://patchwork.kernel.org/patch/10471409/ > > > > Signed-off-by: Lei Cao <lei.cao@xxxxxxxxxxx> > > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > > --- > > Documentation/virt/kvm/api.txt | 89 ++++++++++++++++++ > > arch/x86/include/asm/kvm_host.h | 3 + > > arch/x86/include/uapi/asm/kvm.h | 1 + > > arch/x86/kvm/Makefile | 3 +- > > arch/x86/kvm/mmu/mmu.c | 6 ++ > > arch/x86/kvm/vmx/vmx.c | 7 ++ > > arch/x86/kvm/x86.c | 9 ++ > > include/linux/kvm_dirty_ring.h | 55 +++++++++++ > > include/linux/kvm_host.h | 26 +++++ > > include/trace/events/kvm.h | 78 +++++++++++++++ > > include/uapi/linux/kvm.h | 33 +++++++ > > virt/kvm/dirty_ring.c | 162 ++++++++++++++++++++++++++++++++ > > virt/kvm/kvm_main.c | 137 ++++++++++++++++++++++++++- > > 13 files changed, 606 insertions(+), 3 deletions(-) > > create mode 100644 include/linux/kvm_dirty_ring.h > > create mode 100644 virt/kvm/dirty_ring.c > > > > diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt > > index ebb37b34dcfc..708c3e0f7eae 100644 > > --- a/Documentation/virt/kvm/api.txt > > +++ b/Documentation/virt/kvm/api.txt > > @@ -231,6 +231,7 @@ Based on their initialization different VMs may have different capabilities. > > It is thus encouraged to use the vm ioctl to query for capabilities (available > > with KVM_CAP_CHECK_EXTENSION_VM on the vm fd) > > > > + > > 4.5 KVM_GET_VCPU_MMAP_SIZE > > > > Capability: basic > > @@ -243,6 +244,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared > > memory region. This ioctl returns the size of that region. See the > > KVM_RUN documentation for details. > > > > +Besides the size of the KVM_RUN communication region, other areas of > > +the VCPU file descriptor can be mmap-ed, including: > > + > > +- if KVM_CAP_COALESCED_MMIO is available, a page at > > + KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons, > > + this page is included in the result of KVM_GET_VCPU_MMAP_SIZE. > > + KVM_CAP_COALESCED_MMIO is not documented yet. > > + > > +- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at > > + KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on > > + KVM_CAP_DIRTY_LOG_RING, see section 8.3. > > + > > > > 4.6 KVM_SET_MEMORY_REGION > > > > @@ -5376,6 +5389,7 @@ CPU when the exception is taken. If this virtual SError is taken to EL1 using > > AArch64, this value will be reported in the ISS field of ESR_ELx. > > > > See KVM_CAP_VCPU_EVENTS for more details. > > + > > 8.20 KVM_CAP_HYPERV_SEND_IPI > > > > Architectures: x86 > > @@ -5383,6 +5397,7 @@ Architectures: x86 > > This capability indicates that KVM supports paravirtualized Hyper-V IPI send > > hypercalls: > > HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx. > > + > > 8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH > > > > Architecture: x86 > > @@ -5396,3 +5411,77 @@ handling by KVM (as some KVM hypercall may be mistakenly treated as TLB > > flush hypercalls by Hyper-V) so userspace should disable KVM identification > > in CPUID and only exposes Hyper-V identification. In this case, guest > > thinks it's running on Hyper-V and only use Hyper-V hypercalls. > > + > > +8.22 KVM_CAP_DIRTY_LOG_RING > > + > > +Architectures: x86 > > +Parameters: args[0] - size of the dirty log ring > > + > > +KVM is capable of tracking dirty memory using ring buffers that are > > +mmaped into userspace; there is one dirty ring per vcpu. > > + > > +One dirty ring is defined as below internally: > > + > > +struct kvm_dirty_ring { > > + u32 dirty_index; > > + u32 reset_index; > > + u32 size; > > + u32 soft_limit; > > + struct kvm_dirty_gfn *dirty_gfns; > > + struct kvm_dirty_ring_indices *indices; > > + int index; > > +}; > > + > > +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array. > > +For each of the dirty entry it's defined as: > > + > > +struct kvm_dirty_gfn { > > + __u32 pad; > > How about sticking a length here? > This way huge pages can be dirtied in one go. Not just huge pages, but any contiguous range of dirty pages could be reported far more concisely. Thanks, Alex