RE: RFC: Split EPT huge pages in advance of dirty logging

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> -----Original Message-----
> From: Peter Xu [mailto:peterx@xxxxxxxxxx]
> Sent: Friday, February 21, 2020 2:17 AM
> To: Ben Gardon <bgardon@xxxxxxxxxx>
> Cc: Zhoujian (jay) <jianjay.zhou@xxxxxxxxxx>; Junaid Shahid
> <junaids@xxxxxxxxxx>; kvm@xxxxxxxxxxxxxxx; qemu-devel@xxxxxxxxxx;
> pbonzini@xxxxxxxxxx; dgilbert@xxxxxxxxxx; quintela@xxxxxxxxxx;
> Liujinsong (Paul) <liu.jinsong@xxxxxxxxxx>; linfeng (M)
> <linfeng23@xxxxxxxxxx>; wangxin (U) <wangxinxin.wang@xxxxxxxxxx>;
> Huangweidong (C) <weidong.huang@xxxxxxxxxx>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> 
> On Thu, Feb 20, 2020 at 09:34:52AM -0800, Ben Gardon wrote:
> > On Thu, Feb 20, 2020 at 5:53 AM Zhoujian (jay) <jianjay.zhou@xxxxxxxxxx>
> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Peter Xu [mailto:peterx@xxxxxxxxxx]
> > > > Sent: Thursday, February 20, 2020 1:19 AM
> > > > To: Zhoujian (jay) <jianjay.zhou@xxxxxxxxxx>
> > > > Cc: kvm@xxxxxxxxxxxxxxx; qemu-devel@xxxxxxxxxx;
> > > > pbonzini@xxxxxxxxxx; dgilbert@xxxxxxxxxx; quintela@xxxxxxxxxx;
> > > > Liujinsong (Paul) <liu.jinsong@xxxxxxxxxx>; linfeng (M)
> > > > <linfeng23@xxxxxxxxxx>; wangxin (U)
> <wangxinxin.wang@xxxxxxxxxx>;
> > > > Huangweidong (C) <weidong.huang@xxxxxxxxxx>
> > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> > > >
> > > > On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote:
> > > > > Hi Peter,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Peter Xu [mailto:peterx@xxxxxxxxxx]
> > > > > > Sent: Wednesday, February 19, 2020 1:43 AM
> > > > > > To: Zhoujian (jay) <jianjay.zhou@xxxxxxxxxx>
> > > > > > Cc: kvm@xxxxxxxxxxxxxxx; qemu-devel@xxxxxxxxxx;
> > > > pbonzini@xxxxxxxxxx;
> > > > > > dgilbert@xxxxxxxxxx; quintela@xxxxxxxxxx; Liujinsong (Paul)
> > > > > > <liu.jinsong@xxxxxxxxxx>; linfeng (M) <linfeng23@xxxxxxxxxx>;
> > > > > > wangxin (U) <wangxinxin.wang@xxxxxxxxxx>; Huangweidong (C)
> > > > > > <weidong.huang@xxxxxxxxxx>
> > > > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty
> > > > > > logging
> > > > > >
> > > > > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We found that the guest will be soft-lockup occasionally
> > > > > > > when live migrating a 60 vCPU, 512GiB huge page and memory
> > > > > > > sensitive VM. The reason is clear, almost all of the vCPUs
> > > > > > > are waiting for the KVM MMU spin-lock to create 4K SPTEs
> > > > > > > when the huge pages are write protected. This
> > > > > > phenomenon is also described in this patch set:
> > > > > > > https://patchwork.kernel.org/cover/11163459/
> > > > > > > which aims to handle page faults in parallel more efficiently.
> > > > > > >
> > > > > > > Our idea is to use the migration thread to touch all of the
> > > > > > > guest memory in the granularity of 4K before enabling dirty
> > > > > > > logging. To be more specific, we split all the PDPE_LEVEL
> > > > > > > SPTEs into DIRECTORY_LEVEL SPTEs as the first step, and then
> > > > > > > split all the DIRECTORY_LEVEL SPTEs into
> > > > > > PAGE_TABLE_LEVEL SPTEs as the following step.
> > > > > >
> > > > > > IIUC, QEMU will prefer to use huge pages for all the anonymous
> > > > > > ramblocks (please refer to ram_block_add):
> > > > > >
> > > > > >         qemu_madvise(new_block->host, new_block->max_length,
> > > > > > QEMU_MADV_HUGEPAGE);
> > > > >
> > > > > Yes, you're right
> > > > >
> > > > > >
> > > > > > Another alternative I can think of is to add an extra
> > > > > > parameter to QEMU to explicitly disable huge pages (so that
> > > > > > can even be MADV_NOHUGEPAGE instead of MADV_HUGEPAGE).
> > > > > > However that
> > > > should also
> > > > > > drag down the performance for the whole lifecycle of the VM.
> > > > >
> > > > > From the performance point of view, it is better to keep the
> > > > > huge pages when the VM is not in the live migration state.
> > > > >
> > > > > > A 3rd option is to make a QMP
> > > > > > command to dynamically turn huge pages on/off for ramblocks
> globally.
> > > > >
> > > > > We're searching a dynamic method too.
> > > > > We plan to add two new flags for each memory slot, say
> > > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set
> > > > > through KVM_SET_USER_MEMORY_REGION ioctl.
> 
> [1]
> 
> > > > >
> > > > > The mapping_level which is called by tdp_page_fault in the
> > > > > kernel side will return PT_DIRECTORY_LEVEL if the
> > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES
> > > > > flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL
> > > > > if the KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.
> > > > >
> > > > > The key steps to split the huge pages in advance of enabling
> > > > > dirty log is as follows:
> > > > > 1. The migration thread in user space uses
> > > > KVM_SET_USER_MEMORY_REGION
> > > > > ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each
> > > > memory
> > > > > slot.
> > > > > 2. The migration thread continues to use the
> > > > > KVM_SPLIT_HUGE_PAGES ioctl (which is newly added) to do the
> > > > > splitting of large pages in the kernel side.
> > > > > 3. A new vCPU is created temporally(do some initialization but
> > > > > will not
> > > > > run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
> > > > > 4. Collect the GPA ranges of all the memory slots with the
> > > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
> > > > > 5. Split the 1G huge pages(collected in step 4) into 2M by
> > > > > calling tdp_page_fault, since the mapping_level will return
> > > > > PT_DIRECTORY_LEVEL. Here is the main difference from the usual
> > > > > path which is caused by the Guest side(EPT violation/misconfig
> > > > > etc), we call it directly in the hypervisor side.
> > > > > 6. Do some cleanups, i.e. free the vCPU related resources 7. The
> > > > > KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
> > > > > 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
> > > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in
> > > > > step
> > > > 5
> > > > > the 2M huge pages will be splitted into 4K pages.
> > > > > 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
> > > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory
> slot.
> > > > > 10. Then the migration thread calls the log_start ioctl to
> > > > > enable the dirty logging, and the remaining thing is the same.
> > > >
> > > > I'm not sure... I think it would be good if there is a way to have
> > > > finer granularity control on using huge pages for any process,
> > > > then KVM can directly leverage that because KVM page tables should
> > > > always respect the mm configurations on these (so e.g. when huge page
> split, KVM gets notifications via mmu notifiers).
> > > > Have you thought of such a more general way?
> > >
> > > I did have thought of this, if we split the huge pages into 4K of a
> > > process, I'm afraid it will not be workable for the huge pages
> > > sharing scenario, e.g. DPDK, SPDK etc. So, only split the EPT page
> > > table and keep the VM process page table (e.g. qemu) untouched is the
> goal.
> 
> Ah I see your point now.
> 
> > >
> > > >
> > > > (And I just noticed that MADV_NOHUGEPAGE is only a hint to
> > > > khugepaged and probably won't split any huge page at all after
> > > > madvise() returns..) To tell the truth I'm still confused on how
> > > > split of huge pages helped in your case...
> > >
> > > I'm sorry if the meaning is not expressed clearly, and thanks for your
> patience.
> > >
> > > > If I read it right the test reduced some execution time from 9s to
> > > > a few ms after your splittion of huge pages.
> > >
> > > Yes
> > >
> > > > The thing is I don't see how split of huge pages could solve the
> > > > mmu_lock contention with the huge VM, because IMO even if we split
> > > > the huge pages into smaller ones, those pages should still be
> > > > write-protected and need merely the same number of page faults to
> > > > resolve when accessed/written? And I thought that should only be
> > > > fixed with solutions like what Ben has proposed with the MMU
> > > > rework. Could you show me what I've missed?
> > >
> > > Let me try to describe the reason of mmu_lock contention more
> > > clearly and the effort we tried to do...
> > > The huge VM only has EPT >= level 2 sptes, and level 1 sptes don't
> > > exist at the beginning. Write protect all the huge pages will
> > > trigger EPT violation to create level 1 sptes for all the vCPUs
> > > which want to write the content of the memory. Different vCPU write
> > > the different areas of the memory, but they need the same
> > > kvm->mmu_lock to create the level 1 sptes, this situation will be
> > > worse if the number of vCPU and the memory of VM is large(in our
> > > case 60U512G), meanwhile the VM has memory-write-intensive work to
> > > do. In order to reduce the mmu_lock contention, we try to: write
> > > protect VM memory gradually in small chunks, such as 1G or 2M. Using
> > > a vCPU temporary creately by migration thread to split 1G to 2M as
> > > the first step, and to split 2M to 4K as the second step (this is a
> > > little hacking...and I do not know any side effect will be triggered indeed).
> > > Comparing to write protect all VM memory in one go, the write
> > > protected range is limited in this way and only the vCPUs write this
> > > limited range will be involved to take the mmu_lock. The contention
> > > will be reduced since the memory range is small and the number of
> > > vCPU involved is small too.
> > >
> > > Of course, it will take some extra time to split all the huge pages
> > > into 4K page before the real migration started, about 60s for 512G in my
> experiment.
> > >
> > > During the memory iterative copy phase, PML will do the dirty
> > > logging work (not write protected case for 4K), or IIRC using
> > > fast_page_fault to mark page dirty if PML is not supported, which case the
> mmu_lock does not needed.
> 
> Yes I missed both of these.  Thanks for explaining!
> 
> Then it makes sense at least to me with your idea. Though instead of the
> KVM_MEM_FORCE_PT_* naming [1], we can also embed allowed page sizes
> for the memslot into the flags using a few bits, with another new kvm cap.

Thanks for this suggestion.

> > >
> > > Regards,
> > > Jay Zhou
> >
> > (Ah I top-posted I'm sorry. Re-sending at the bottom.)
> >
> > FWIW, we currently do this eager splitting at Google for live
> > migration. When the log-dirty-memory flag is set on a memslot we
> > eagerly split all pages in the slot down to 4k granularity.
> > As Jay said, this does not cause crippling lock contention because the
> > vCPU page faults generated by write protection / splitting can be
> > resolved in the fast page fault path without acquiring the MMU lock.
> > I believe +Junaid Shahid tried to upstream this approach at some point
> > in the past, but the patch set didn't make it in. (This was before my
> > time, so I'm hoping he has a link.) 

I tried to google the "eagerly split " approach using the keywords like
" eager splitting ", " eager splitting live migration " or add the name of
the author, but unfortunately no relevant links were found...
But still thanks for this information.

Regards,
Jay Zhou

>> I haven't done the analysis to
> > know if eager splitting is more or less efficient with parallel
> > slow-path page faults, but it's definitely faster under the MMU lock.
> 
> Yes, totally agreed.  Though comparing to eager splitting (which might still
> need a new capabilility for the changed behavior after all, not sure...), the
> per-memslot hint solution looks slightly nicer to me, imho, because it can offer
> more mechanism than policy.
> 
> Thanks,
> 
> --
> Peter Xu





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux