On Mon, Apr 13, 2015 at 09:45:39AM +0800, Xiao Guangrong wrote: > > >On 04/11/2015 02:05 AM, Andres Lagar-Cavilla wrote: >>On Fri, Apr 3, 2015 at 12:40 AM, Wanpeng Li <wanpeng.li@xxxxxxxxxxxxxxx> wrote: >>> >>>There are two scenarios for the requirement of collapsing small sptes >>>into large sptes. >>>- dirty logging tracks sptes in 4k granularity, so large sptes are split, >>> the large sptes will be reallocated in the destination machine and the >>> guest in the source machine will be destroyed when live migration successfully. >>> However, the guest in the source machine will continue to run if live migration >>> fail due to some reasons, the sptes still keep small which lead to bad >>> performance. >>>- our customers write tools to track the dirty speed of guests by EPT D bit/PML >>> in order to determine the most appropriate one to be live migrated, however >>> sptes will still keep small after tracking dirty speed. >>> >>>This patch introduce lazy collapse small sptes into large sptes, the memory region >>>will be scanned on the ioctl context when dirty log is stopped, the ones which can >>>be collapsed into large pages will be dropped during the scan, it depends the on >>>later #PF to reallocate all large sptes. >>> >>>Reviewed-by: Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx> >>>Signed-off-by: Wanpeng Li <wanpeng.li@xxxxxxxxxxxxxxx> >> >>Hi, apologies for late review (vacation), but wanted to bring >>attention to a few matters: > >No problem, your comments are really valuable to us. :) > >> >>> >>>--- >>>v2 -> v3: >>> * update comments >>> * fix infinite for loop >>>v1 -> v2: >>> * use 'bool' instead of 'int' >>> * add more comments >>> * fix can not get the next spte after drop the current spte >>> >>> arch/x86/include/asm/kvm_host.h | 2 ++ >>> arch/x86/kvm/mmu.c | 73 +++++++++++++++++++++++++++++++++++++++++ >>> arch/x86/kvm/x86.c | 19 +++++++++++ >>> 3 files changed, 94 insertions(+) >>> >>>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >>>index 30b28dc..91b5bdb 100644 >>>--- a/arch/x86/include/asm/kvm_host.h >>>+++ b/arch/x86/include/asm/kvm_host.h >>>@@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, >>> void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); >>> void kvm_mmu_slot_remove_write_access(struct kvm *kvm, >>> struct kvm_memory_slot *memslot); >>>+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, >>>+ struct kvm_memory_slot *memslot); >>> void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, >>> struct kvm_memory_slot *memslot); >>> void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm, >>>diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c >>>index cee7592..ba002a0 100644 >>>--- a/arch/x86/kvm/mmu.c >>>+++ b/arch/x86/kvm/mmu.c >>>@@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, >>> kvm_flush_remote_tlbs(kvm); >>> } >>> >>>+static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, >>>+ unsigned long *rmapp) >>>+{ >>>+ u64 *sptep; >>>+ struct rmap_iterator iter; >>>+ int need_tlb_flush = 0; >>>+ pfn_t pfn; >>>+ struct kvm_mmu_page *sp; >>>+ >>>+ for (sptep = rmap_get_first(*rmapp, &iter); sptep;) { >>>+ BUG_ON(!(*sptep & PT_PRESENT_MASK)); >>>+ >>>+ sp = page_header(__pa(sptep)); >>>+ pfn = spte_to_pfn(*sptep); >>>+ >>>+ /* >>>+ * Lets support EPT only for now, there still needs to figure >>>+ * out an efficient way to let these codes be aware what mapping >>>+ * level used in guest. >>>+ */ >>>+ if (sp->role.direct && >>>+ !kvm_is_reserved_pfn(pfn) && >>>+ PageTransCompound(pfn_to_page(pfn))) { >> >>Not your fault, but PageTransCompound is very unhappy naming, as it >>also yields true for PageHuge. Suggestion: document this check covers >>static hugetlbfs, or switch to PageCompound() check. >> >>A slightly bolder approach would be to refactor and reuse the nearly >>identical check done in transparent_hugepage_adjust, instead of >>open-coding here. In essence this code is asking for the same check, >>plus the out-of-band check for static hugepages. > >I agree. > >> >> >>>+ drop_spte(kvm, sptep); >>>+ sptep = rmap_get_first(*rmapp, &iter); >>>+ need_tlb_flush = 1; >>>+ } else >>>+ sptep = rmap_get_next(&iter); >>>+ } >>>+ >>>+ return need_tlb_flush; >>>+} >>>+ >>>+void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, >>>+ struct kvm_memory_slot *memslot) >>>+{ >>>+ bool flush = false; >>>+ unsigned long *rmapp; >>>+ unsigned long last_index, index; >>>+ gfn_t gfn_start, gfn_end; >>>+ >>>+ spin_lock(&kvm->mmu_lock); >>>+ >>>+ gfn_start = memslot->base_gfn; >>>+ gfn_end = memslot->base_gfn + memslot->npages - 1; >>>+ >>>+ if (gfn_start >= gfn_end) >>>+ goto out; >> >>I don't understand the value of this check here. Are we looking for a >>broken memslot? Shouldn't this be a BUG_ON? Is this the place to care >>about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e. >>2^31. A 64 bit overflow would be caused by a gigantic gfn_start which >>would be trouble in many other ways. >> >>All this to say: please remove the above 5 lines and make code simpler. > >Yes, this check is unnecessary indeed. > >> >>>+ >>>+ rmapp = memslot->arch.rmap[0]; >>>+ last_index = gfn_to_index(gfn_end, memslot->base_gfn, >>>+ PT_PAGE_TABLE_LEVEL); >>>+ >>>+ for (index = 0; index <= last_index; ++index, ++rmapp) { >> >>One could argue that the cleaner iteration should be over the gfn >>space covered by the memslot, thus leaving the gfn <--> rmap <--> spte >>interactions hidden under the hood of __gfn_to_rmap. That yields much >>cleaner (IMHO) code: >> >> for (gfn = memslot->base_gfn; gfn <= memslot->base_gfn + >>memslot->npages; gfn++) { >> flush |= kvm_mmu_zap_collapsible_spte(kvm, __gfn_to_rmap(gfn, >>1, memslot)); >> .... >> >>Now you can also get rid of index, last_index and rmapp. And more >>importantly, the code is more understandable, and follows pattern as >>established in x86/kvm/mmu. > >Do not have strong opinion on it. Current code also has this style, please >refer to kvm_mmu_slot_remove_write_access(). > >> >>>+ if (*rmapp) >>>+ flush |= kvm_mmu_zap_collapsible_spte(kvm, rmapp); >>>+ >>>+ if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { >>>+ if (flush) { >>>+ kvm_flush_remote_tlbs(kvm); >>>+ flush = false; >>>+ } >>>+ cond_resched_lock(&kvm->mmu_lock); >> >>Relinquishing this spinlock is problematic, because >>commit_memory_region has not gotten around to removing write >>protection. Are you certain no new write-protected PTEs will be >>inserted by a racing fault that sneaks in while the spinlock is >>relinquished? >> > >I do not know clearly about the problem, new spte creation will be based on >the host mapping, i.e, the huge mapping on shadow page table will be >sync-ed with huge mapping on host. Could you please detail the problem? > >Wanpeng, could you please post a patch to address Andres's comments? Yeah, I will post a patch which is based on this one after merge window. :) Regards, Wanpeng Li -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html