On Sun, Apr 12, 2015 at 6:45 PM, Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx> wrote: > > > On 04/11/2015 02:05 AM, Andres Lagar-Cavilla wrote: >> >> On Fri, Apr 3, 2015 at 12:40 AM, Wanpeng Li <wanpeng.li@xxxxxxxxxxxxxxx> >> wrote: >>> >>> >>> There are two scenarios for the requirement of collapsing small sptes >>> into large sptes. >>> - dirty logging tracks sptes in 4k granularity, so large sptes are split, >>> the large sptes will be reallocated in the destination machine and the >>> guest in the source machine will be destroyed when live migration >>> successfully. >>> However, the guest in the source machine will continue to run if live >>> migration >>> fail due to some reasons, the sptes still keep small which lead to bad >>> performance. >>> - our customers write tools to track the dirty speed of guests by EPT D >>> bit/PML >>> in order to determine the most appropriate one to be live migrated, >>> however >>> sptes will still keep small after tracking dirty speed. >>> >>> This patch introduce lazy collapse small sptes into large sptes, the >>> memory region >>> will be scanned on the ioctl context when dirty log is stopped, the ones >>> which can >>> be collapsed into large pages will be dropped during the scan, it depends >>> the on >>> later #PF to reallocate all large sptes. >>> >>> Reviewed-by: Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx> >>> Signed-off-by: Wanpeng Li <wanpeng.li@xxxxxxxxxxxxxxx> >> >> >> Hi, apologies for late review (vacation), but wanted to bring >> attention to a few matters: > > > No problem, your comments are really valuable to us. :) Glad to know, thanks. > > >> >>> >>> --- >>> v2 -> v3: >>> * update comments >>> * fix infinite for loop >>> v1 -> v2: >>> * use 'bool' instead of 'int' >>> * add more comments >>> * fix can not get the next spte after drop the current spte >>> >>> arch/x86/include/asm/kvm_host.h | 2 ++ >>> arch/x86/kvm/mmu.c | 73 >>> +++++++++++++++++++++++++++++++++++++++++ >>> arch/x86/kvm/x86.c | 19 +++++++++++ >>> 3 files changed, 94 insertions(+) >>> >>> diff --git a/arch/x86/include/asm/kvm_host.h >>> b/arch/x86/include/asm/kvm_host.h >>> index 30b28dc..91b5bdb 100644 >>> --- a/arch/x86/include/asm/kvm_host.h >>> +++ b/arch/x86/include/asm/kvm_host.h >>> @@ -854,6 +854,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 >>> accessed_mask, >>> void kvm_mmu_reset_context(struct kvm_vcpu *vcpu); >>> void kvm_mmu_slot_remove_write_access(struct kvm *kvm, >>> struct kvm_memory_slot *memslot); >>> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, >>> + struct kvm_memory_slot *memslot); >>> void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, >>> struct kvm_memory_slot *memslot); >>> void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm, >>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c >>> index cee7592..ba002a0 100644 >>> --- a/arch/x86/kvm/mmu.c >>> +++ b/arch/x86/kvm/mmu.c >>> @@ -4465,6 +4465,79 @@ void kvm_mmu_slot_remove_write_access(struct kvm >>> *kvm, >>> kvm_flush_remote_tlbs(kvm); >>> } >>> >>> +static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, >>> + unsigned long *rmapp) >>> +{ >>> + u64 *sptep; >>> + struct rmap_iterator iter; >>> + int need_tlb_flush = 0; >>> + pfn_t pfn; >>> + struct kvm_mmu_page *sp; >>> + >>> + for (sptep = rmap_get_first(*rmapp, &iter); sptep;) { >>> + BUG_ON(!(*sptep & PT_PRESENT_MASK)); >>> + >>> + sp = page_header(__pa(sptep)); >>> + pfn = spte_to_pfn(*sptep); >>> + >>> + /* >>> + * Lets support EPT only for now, there still needs to >>> figure >>> + * out an efficient way to let these codes be aware what >>> mapping >>> + * level used in guest. >>> + */ >>> + if (sp->role.direct && >>> + !kvm_is_reserved_pfn(pfn) && >>> + PageTransCompound(pfn_to_page(pfn))) { >> >> >> Not your fault, but PageTransCompound is very unhappy naming, as it >> also yields true for PageHuge. Suggestion: document this check covers >> static hugetlbfs, or switch to PageCompound() check. >> >> A slightly bolder approach would be to refactor and reuse the nearly >> identical check done in transparent_hugepage_adjust, instead of >> open-coding here. In essence this code is asking for the same check, >> plus the out-of-band check for static hugepages. > > > I agree. > > >> >> >>> + drop_spte(kvm, sptep); >>> + sptep = rmap_get_first(*rmapp, &iter); >>> + need_tlb_flush = 1; >>> + } else >>> + sptep = rmap_get_next(&iter); >>> + } >>> + >>> + return need_tlb_flush; >>> +} >>> + >>> +void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, >>> + struct kvm_memory_slot *memslot) >>> +{ >>> + bool flush = false; >>> + unsigned long *rmapp; >>> + unsigned long last_index, index; >>> + gfn_t gfn_start, gfn_end; >>> + >>> + spin_lock(&kvm->mmu_lock); >>> + >>> + gfn_start = memslot->base_gfn; >>> + gfn_end = memslot->base_gfn + memslot->npages - 1; >>> + >>> + if (gfn_start >= gfn_end) >>> + goto out; >> >> >> I don't understand the value of this check here. Are we looking for a >> broken memslot? Shouldn't this be a BUG_ON? Is this the place to care >> about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e. >> 2^31. A 64 bit overflow would be caused by a gigantic gfn_start which >> would be trouble in many other ways. >> >> All this to say: please remove the above 5 lines and make code simpler. > > > Yes, this check is unnecessary indeed. > >> >>> + >>> + rmapp = memslot->arch.rmap[0]; >>> + last_index = gfn_to_index(gfn_end, memslot->base_gfn, >>> + PT_PAGE_TABLE_LEVEL); >>> + >>> + for (index = 0; index <= last_index; ++index, ++rmapp) { >> >> >> One could argue that the cleaner iteration should be over the gfn >> space covered by the memslot, thus leaving the gfn <--> rmap <--> spte >> interactions hidden under the hood of __gfn_to_rmap. That yields much >> cleaner (IMHO) code: >> >> for (gfn = memslot->base_gfn; gfn <= memslot->base_gfn + >> memslot->npages; gfn++) { >> flush |= kvm_mmu_zap_collapsible_spte(kvm, __gfn_to_rmap(gfn, >> 1, memslot)); >> .... >> >> Now you can also get rid of index, last_index and rmapp. And more >> importantly, the code is more understandable, and follows pattern as >> established in x86/kvm/mmu. > > > Do not have strong opinion on it. Current code also has this style, please > refer to kvm_mmu_slot_remove_write_access(). > Ugh. You're right. Not your fight to pick, I guess. >> >>> + if (*rmapp) >>> + flush |= kvm_mmu_zap_collapsible_spte(kvm, >>> rmapp); >>> + >>> + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) { >>> + if (flush) { >>> + kvm_flush_remote_tlbs(kvm); >>> + flush = false; >>> + } >>> + cond_resched_lock(&kvm->mmu_lock); >> >> >> Relinquishing this spinlock is problematic, because >> commit_memory_region has not gotten around to removing write >> protection. Are you certain no new write-protected PTEs will be >> inserted by a racing fault that sneaks in while the spinlock is >> relinquished? >> > > I do not know clearly about the problem, new spte creation will be based on > the host mapping, i.e, the huge mapping on shadow page table will be > sync-ed with huge mapping on host. Could you please detail the problem? > It seems that, prior to calling commit_memory_region, the new memslot with dirty_bitmap = NULL is made effective. This will ensure force_pt_level is false in tdp_page_fault, and huge mappings are observed. So all is well and I was fud'ing. N.B.: software shadow mode? > Wanpeng, could you please post a patch to address Andres's comments? > Awesome, thanks Andres -- Andres Lagar-Cavilla | Google Kernel Team | andreslc@xxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html