> -----Original Message----- > From: Peter Xu [mailto:peterx@xxxxxxxxxx] > Sent: Friday, February 21, 2020 3:17 AM > To: Zhoujian (jay) <jianjay.zhou@xxxxxxxxxx> > Cc: kvm@xxxxxxxxxxxxxxx; pbonzini@xxxxxxxxxx; wangxin (U) > <wangxinxin.wang@xxxxxxxxxx>; Huangweidong (C) > <weidong.huang@xxxxxxxxxx>; sean.j.christopherson@xxxxxxxxx; Liujinsong > (Paul) <liu.jinsong@xxxxxxxxxx> > Subject: Re: [PATCH v2] KVM: x86: enable dirty log gradually in small chunks > > On Thu, Feb 20, 2020 at 12:28:28PM +0800, Jay Zhou wrote: > > It could take kvm->mmu_lock for an extended period of time when > > enabling dirty log for the first time. The main cost is to clear all > > the D-bits of last level SPTEs. This situation can benefit from manual > > dirty log protect as well, which can reduce the mmu_lock time taken. > > The sequence is like this: > > > > 1. Initialize all the bits of the dirty bitmap to 1 when enabling > > dirty log for the first time > > 2. Only write protect the huge pages > > 3. KVM_GET_DIRTY_LOG returns the dirty bitmap info 4. > > KVM_CLEAR_DIRTY_LOG will clear D-bit for each of the leaf level > > SPTEs gradually in small chunks > > > > Under the Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz environment, I did > > some tests with a 128G windows VM and counted the time taken of > > memory_global_dirty_log_start, here is the numbers: > > > > VM Size Before After optimization > > 128G 460ms 10ms > > > > Signed-off-by: Jay Zhou <jianjay.zhou@xxxxxxxxxx> > > --- > > v2: > > * add new bit to KVM_ENABLE_CAP for > KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 [Paolo] > > * support non-PML path [Peter] > > * delete the unnecessary ifdef and make the initialization of bitmap > > more clear [Sean] > > * document the new bits and tweak the testcase > > > > Documentation/virt/kvm/api.rst | 23 > +++++++++++++++++------ > > arch/x86/kvm/mmu/mmu.c | 8 ++++++-- > > arch/x86/kvm/vmx/vmx.c | 3 ++- > > include/linux/kvm_host.h | 7 ++++++- > > tools/testing/selftests/kvm/dirty_log_test.c | 3 ++- > > virt/kvm/kvm_main.c | 25 > ++++++++++++++++++------- > > 6 files changed, 51 insertions(+), 18 deletions(-) > > > > diff --git a/Documentation/virt/kvm/api.rst > > b/Documentation/virt/kvm/api.rst index 97a72a5..1afd310 100644 > > --- a/Documentation/virt/kvm/api.rst > > +++ b/Documentation/virt/kvm/api.rst > > @@ -1267,9 +1267,11 @@ pages in the host. > > The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and > > KVM_MEM_READONLY. The former can be set to instruct KVM to keep > track > > of writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to > > know how to -use it. The latter can be set, if KVM_CAP_READONLY_MEM > > capability allows it, -to make a new slot read-only. In this case, > > writes to this memory will be -posted to userspace as KVM_EXIT_MMIO > exits. > > +use it. It will be different if the KVM_DIRTY_LOG_INITIALLY_SET flag > > +of > > +KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is set. For more information, > see > > +the description of the capability. The latter can be set, if > > +KVM_CAP_READONLY_MEM capability allows it, to make a new slot > > +read-only. In this case, writes to this memory will be posted to userspace > as KVM_EXIT_MMIO exits. > > Not sure about others, but my own preference is that we keep this part > untouched... The changed document could be even more confusing to me... Just want to mention something different happened for the other code logic if KVM_DIRTY_LOG_INITIALLY_SET flag is set, but I have to admit that the "difference" is not described clearly here. It is described at [1] below, so keep this part untouched maybe is a choice. > > > > When the KVM_CAP_SYNC_MMU capability is available, changes in the > > backing of the memory region are automatically reflected into the > > guest. For example, an @@ -5704,10 +5706,19 @@ and injected > exceptions. > > :Architectures: x86, arm, arm64, mips > > :Parameters: args[0] whether feature should be enabled or not > > > > -With this capability enabled, KVM_GET_DIRTY_LOG will not > > automatically > > +Valid flags are:: > > + > > + #define KVM_DIRTY_LOG_MANUAL_PROTECT (1 << 0) #define > > + KVM_DIRTY_LOG_INITIALLY_SET (1 << 1) > > + > > +With KVM_DIRTY_LOG_MANUAL_PROTECT set, KVM_GET_DIRTY_LOG will > not > > +automatically > > clear and write-protect all pages that are returned as dirty. > > Rather, userspace will have to do this operation separately using > > -KVM_CLEAR_DIRTY_LOG. > > +KVM_CLEAR_DIRTY_LOG. With KVM_DIRTY_LOG_INITIALLY_SET set, all > the > > +bits of the dirty bitmap will be initialized to 1 when created, dirty > > +logging will be enabled gradually in small chunks using > > +KVM_CLEAR_DIRTY_LOG ioctl. However, the > KVM_DIRTY_LOG_INITIALLY_SET > > +depends on KVM_DIRTY_LOG_MANUAL_PROTECT, it can not be set > individually and supports x86 only for now. [1] > > Need > s/KVM_DIRTY_LOG_MANUAL_PROTECT/KVM_DIRTY_LOG_MANUAL_PROTEC > T2/? Since the ioctl name is KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2, this naming replacement seems reasonable. If no one is against it, I'll take it in the next version. > > > > > At the cost of a slightly more complicated operation, this provides > > better scalability and responsiveness for two reasons. First, @@ > > -5716,7 +5727,7 @@ than requiring to sync a full memslot; this ensures > > that KVM does not take spinlocks for an extended period of time. > > Second, in some cases a large amount of time can pass between a call > > to KVM_GET_DIRTY_LOG and userspace actually using the data in the > > page. Pages can be modified -during this time, which is inefficint for both > the guest and userspace: > > +during this time, which is inefficient for both the guest and userspace: > > the guest will incur a higher penalty due to write protection faults, > > while userspace can see false reports of dirty pages. Manual > > reprotection helps reducing this time, improving guest performance > > and reducing the diff --git a/arch/x86/kvm/mmu/mmu.c > > b/arch/x86/kvm/mmu/mmu.c index 87e9ba2..f9c120e 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -5865,8 +5865,12 @@ void > kvm_mmu_slot_remove_write_access(struct kvm *kvm, > > bool flush; > > > > spin_lock(&kvm->mmu_lock); > > - flush = slot_handle_all_level(kvm, memslot, slot_rmap_write_protect, > > - false); > > + if (kvm->manual_dirty_log_protect & KVM_DIRTY_LOG_INITIALLY_SET) > > + flush = slot_handle_large_level(kvm, memslot, > > + slot_rmap_write_protect, false); > > + else > > + flush = slot_handle_all_level(kvm, memslot, > > + slot_rmap_write_protect, false); > > spin_unlock(&kvm->mmu_lock); > > > > /* > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index > > 3be25ec..fcc585a 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -7201,7 +7201,8 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, > > int cpu) static void vmx_slot_enable_log_dirty(struct kvm *kvm, > > struct kvm_memory_slot *slot) { > > - kvm_mmu_slot_leaf_clear_dirty(kvm, slot); > > + if (!(kvm->manual_dirty_log_protect & KVM_DIRTY_LOG_INITIALLY_SET)) > > + kvm_mmu_slot_leaf_clear_dirty(kvm, slot); > > kvm_mmu_slot_largepage_remove_write_access(kvm, slot); } > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index > > e89eb67..a555b52 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -39,6 +39,11 @@ > > #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS #endif > > > > +#define KVM_DIRTY_LOG_MANUAL_PROTECT (1 << 0) #define > > +KVM_DIRTY_LOG_INITIALLY_SET (1 << 1) #define > > +KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT | > \ > > + KVM_DIRTY_LOG_INITIALLY_SET) > > + > > /* > > * The bit 16 ~ bit 31 of kvm_memory_region::flags are internally used > > * in kvm, other bits are visible for userspace which are defined in > > @@ -493,7 +498,7 @@ struct kvm { #endif > > long tlbs_dirty; > > struct list_head devices; > > - bool manual_dirty_log_protect; > > + u64 manual_dirty_log_protect; > > struct dentry *debugfs_dentry; > > struct kvm_stat_data **debugfs_stat_data; > > struct srcu_struct srcu; > > diff --git a/tools/testing/selftests/kvm/dirty_log_test.c > > b/tools/testing/selftests/kvm/dirty_log_test.c > > index 5614222..2a493c1 100644 > > --- a/tools/testing/selftests/kvm/dirty_log_test.c > > +++ b/tools/testing/selftests/kvm/dirty_log_test.c > > @@ -317,10 +317,11 @@ static void run_test(enum vm_guest_mode mode, > unsigned long iterations, > > host_bmap_track = bitmap_alloc(host_num_pages); > > > > #ifdef USE_CLEAR_DIRTY_LOG > > + int ret = kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2); > > struct kvm_enable_cap cap = {}; > > > > cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2; > > - cap.args[0] = 1; > > + cap.args[0] = ret; > > You enabled the initial-all-set but didn't really check it, so it didn't help much > from the testcase pov... vm_enable_cap is called afterwards, the return value is checked inside it, may I ask this check is enough, or it is needed to get the value through something like vm_get_cap ? > I'd suggest you drop this change, and you can work > on top after this patch can be accepted. OK, some input parameters for cap.args[0] should be tested I think: 0, 1, 3 should be accepted, the other numbers will not. > > (Not to mention the original test actually verified that we don't break, which > seems good..) > > > vm_enable_cap(vm, &cap); > > #endif > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index > > 70f03ce..f2631d0 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -862,7 +862,7 @@ static int kvm_vm_release(struct inode *inode, > struct file *filp) > > * Allocation size is twice as large as the actual dirty bitmap size. > > * See x86's kvm_vm_ioctl_get_dirty_log() why this is needed. > > */ > > -static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot) > > +static int kvm_alloc_dirty_bitmap(struct kvm_memory_slot *memslot) > > This change seems irrelevant.. > > > { > > unsigned long dirty_bytes = 2 * kvm_dirty_bitmap_bytes(memslot); > > > > @@ -1094,8 +1094,11 @@ int __kvm_set_memory_region(struct kvm *kvm, > > > > /* Allocate page dirty bitmap if needed */ > > if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) { > > - if (kvm_create_dirty_bitmap(&new) < 0) > > + if (kvm_alloc_dirty_bitmap(&new)) > > Same here. s/kvm_create_dirty_bitmap/kvm_alloc_dirty_bitmap is Sean's suggestion to make it clear that the helper is only responsible for allocation, then set all the bitmap bits to 1 using bitmap_set afterwards, which seems reasonable. Do you still think it's better to keep this name untouched? > > > goto out_free; > > + > > + if (kvm->manual_dirty_log_protect & > KVM_DIRTY_LOG_INITIALLY_SET) > > (Maybe time to introduce a helper to shorten this check. :) Yeah, but could this be done on top of this patch? > > > + bitmap_set(new.dirty_bitmap, 0, new.npages); > > } > > > > slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL_ACCOUNT); > > @@ -1255,7 +1258,7 @@ int kvm_get_dirty_log_protect(struct kvm *kvm, > > > > n = kvm_dirty_bitmap_bytes(memslot); > > *flush = false; > > - if (kvm->manual_dirty_log_protect) { > > + if (kvm->manual_dirty_log_protect & > KVM_DIRTY_LOG_MANUAL_PROTECT) { > > Can also introduce a helper for this too. > > Side note: logically this line doesn't need a change, because bit 1 depends on bit > 0 so if (kvm->manual_dirty_log_protect) it means bit 0 must be set. Yes, I agree. > > > /* > > * Unlike kvm_get_dirty_log, we always return false in *flush, > > * because no flush is needed until KVM_CLEAR_DIRTY_LOG. > There @@ > > -3310,9 +3313,6 @@ static long > kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) > > case KVM_CAP_IOEVENTFD_ANY_LENGTH: > > case KVM_CAP_CHECK_EXTENSION_VM: > > case KVM_CAP_ENABLE_CAP_VM: > > -#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT > > - case KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2: > > -#endif > > return 1; > > #ifdef CONFIG_KVM_MMIO > > case KVM_CAP_COALESCED_MMIO: > > @@ -3320,6 +3320,10 @@ static long > kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) > > case KVM_CAP_COALESCED_PIO: > > return 1; > > #endif > > +#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT > > + case KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2: > > + return KVM_DIRTY_LOG_MANUAL_CAPS; > > +#endif > > #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING > > case KVM_CAP_IRQ_ROUTING: > > return KVM_MAX_IRQ_ROUTES; > > @@ -3348,7 +3352,14 @@ static int > kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, > > switch (cap->cap) { > > #ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT > > case KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2: > > - if (cap->flags || (cap->args[0] & ~1)) > > + if (cap->flags || > > + (cap->args[0] & ~KVM_DIRTY_LOG_MANUAL_CAPS) || > > + /* The capability of KVM_DIRTY_LOG_INITIALLY_SET depends > > + * on KVM_DIRTY_LOG_MANUAL_PROTECT, it should not be > > + * set individually > > + */ > > + ((cap->args[0] & KVM_DIRTY_LOG_MANUAL_CAPS) == > > + KVM_DIRTY_LOG_INITIALLY_SET)) > > How about something easier to read? :) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index > 70f03ce0e5c1..9dfbab2a9929 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -3348,7 +3348,10 @@ static int > kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm, > switch (cap->cap) { > #ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT > case KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2: > - if (cap->flags || (cap->args[0] & ~1)) > + if (cap->flags || (cap->args[0] & ~3)) > + return -EINVAL; > + /* Allow 00, 01, and 11. */ > + if (cap->args[0] == KVM_DIRTY_LOG_INITIALLY_SET) > return -EINVAL; > kvm->manual_dirty_log_protect = cap->args[0]; > return 0; Looks simpler than mine. > Otherwise it looks good to me! Thanks! Regards, Jay Zhou > > Thanks, > > -- > Peter Xu