On Thu, Feb 20, 2020 at 09:34:52AM -0800, Ben Gardon wrote: > On Thu, Feb 20, 2020 at 5:53 AM Zhoujian (jay) <jianjay.zhou@xxxxxxxxxx> wrote: > > > > > > > > > -----Original Message----- > > > From: Peter Xu [mailto:peterx@xxxxxxxxxx] > > > Sent: Thursday, February 20, 2020 1:19 AM > > > To: Zhoujian (jay) <jianjay.zhou@xxxxxxxxxx> > > > Cc: kvm@xxxxxxxxxxxxxxx; qemu-devel@xxxxxxxxxx; pbonzini@xxxxxxxxxx; > > > dgilbert@xxxxxxxxxx; quintela@xxxxxxxxxx; Liujinsong (Paul) > > > <liu.jinsong@xxxxxxxxxx>; linfeng (M) <linfeng23@xxxxxxxxxx>; wangxin (U) > > > <wangxinxin.wang@xxxxxxxxxx>; Huangweidong (C) > > > <weidong.huang@xxxxxxxxxx> > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging > > > > > > On Wed, Feb 19, 2020 at 01:19:08PM +0000, Zhoujian (jay) wrote: > > > > Hi Peter, > > > > > > > > > -----Original Message----- > > > > > From: Peter Xu [mailto:peterx@xxxxxxxxxx] > > > > > Sent: Wednesday, February 19, 2020 1:43 AM > > > > > To: Zhoujian (jay) <jianjay.zhou@xxxxxxxxxx> > > > > > Cc: kvm@xxxxxxxxxxxxxxx; qemu-devel@xxxxxxxxxx; > > > pbonzini@xxxxxxxxxx; > > > > > dgilbert@xxxxxxxxxx; quintela@xxxxxxxxxx; Liujinsong (Paul) > > > > > <liu.jinsong@xxxxxxxxxx>; linfeng (M) <linfeng23@xxxxxxxxxx>; > > > > > wangxin (U) <wangxinxin.wang@xxxxxxxxxx>; Huangweidong (C) > > > > > <weidong.huang@xxxxxxxxxx> > > > > > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging > > > > > > > > > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote: > > > > > > Hi all, > > > > > > > > > > > > We found that the guest will be soft-lockup occasionally when live > > > > > > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The > > > > > > reason is clear, almost all of the vCPUs are waiting for the KVM > > > > > > MMU spin-lock to create 4K SPTEs when the huge pages are write > > > > > > protected. This > > > > > phenomenon is also described in this patch set: > > > > > > https://patchwork.kernel.org/cover/11163459/ > > > > > > which aims to handle page faults in parallel more efficiently. > > > > > > > > > > > > Our idea is to use the migration thread to touch all of the guest > > > > > > memory in the granularity of 4K before enabling dirty logging. To > > > > > > be more specific, we split all the PDPE_LEVEL SPTEs into > > > > > > DIRECTORY_LEVEL SPTEs as the first step, and then split all the > > > > > > DIRECTORY_LEVEL SPTEs into > > > > > PAGE_TABLE_LEVEL SPTEs as the following step. > > > > > > > > > > IIUC, QEMU will prefer to use huge pages for all the anonymous > > > > > ramblocks (please refer to ram_block_add): > > > > > > > > > > qemu_madvise(new_block->host, new_block->max_length, > > > > > QEMU_MADV_HUGEPAGE); > > > > > > > > Yes, you're right > > > > > > > > > > > > > > Another alternative I can think of is to add an extra parameter to > > > > > QEMU to explicitly disable huge pages (so that can even be > > > > > MADV_NOHUGEPAGE instead of MADV_HUGEPAGE). However that > > > should also > > > > > drag down the performance for the whole lifecycle of the VM. > > > > > > > > From the performance point of view, it is better to keep the huge > > > > pages when the VM is not in the live migration state. > > > > > > > > > A 3rd option is to make a QMP > > > > > command to dynamically turn huge pages on/off for ramblocks globally. > > > > > > > > We're searching a dynamic method too. > > > > We plan to add two new flags for each memory slot, say > > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES and > > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set through > > > > KVM_SET_USER_MEMORY_REGION ioctl. [1] > > > > > > > > The mapping_level which is called by tdp_page_fault in the kernel side > > > > will return PT_DIRECTORY_LEVEL if the > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES > > > > flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL if the > > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set. > > > > > > > > The key steps to split the huge pages in advance of enabling dirty log > > > > is as follows: > > > > 1. The migration thread in user space uses > > > KVM_SET_USER_MEMORY_REGION > > > > ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each > > > memory > > > > slot. > > > > 2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES > > > > ioctl (which is newly added) to do the splitting of large pages in the > > > > kernel side. > > > > 3. A new vCPU is created temporally(do some initialization but will > > > > not > > > > run) to help to do the work, i.e. as the parameter of the tdp_page_fault. > > > > 4. Collect the GPA ranges of all the memory slots with the > > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set. > > > > 5. Split the 1G huge pages(collected in step 4) into 2M by calling > > > > tdp_page_fault, since the mapping_level will return > > > > PT_DIRECTORY_LEVEL. Here is the main difference from the usual path > > > > which is caused by the Guest side(EPT violation/misconfig etc), we > > > > call it directly in the hypervisor side. > > > > 6. Do some cleanups, i.e. free the vCPU related resources 7. The > > > > KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side. > > > > 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of > > > > KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in step > > > 5 > > > > the 2M huge pages will be splitted into 4K pages. > > > > 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and > > > > KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot. > > > > 10. Then the migration thread calls the log_start ioctl to enable the > > > > dirty logging, and the remaining thing is the same. > > > > > > I'm not sure... I think it would be good if there is a way to have finer granularity > > > control on using huge pages for any process, then KVM can directly leverage > > > that because KVM page tables should always respect the mm configurations on > > > these (so e.g. when huge page split, KVM gets notifications via mmu notifiers). > > > Have you thought of such a more general way? > > > > I did have thought of this, if we split the huge pages into 4K of a process, I'm > > afraid it will not be workable for the huge pages sharing scenario, e.g. DPDK, > > SPDK etc. So, only split the EPT page table and keep the VM process page table > > (e.g. qemu) untouched is the goal. Ah I see your point now. > > > > > > > > (And I just noticed that MADV_NOHUGEPAGE is only a hint to khugepaged > > > and probably won't split any huge page at all after madvise() returns..) > > > To tell the truth I'm still confused on how split of huge pages helped in your > > > case... > > > > I'm sorry if the meaning is not expressed clearly, and thanks for your patience. > > > > > If I read it right the test reduced some execution time from 9s to a > > > few ms after your splittion of huge pages. > > > > Yes > > > > > The thing is I don't see how split of > > > huge pages could solve the mmu_lock contention with the huge VM, because > > > IMO even if we split the huge pages into smaller ones, those pages should still > > > be write-protected and need merely the same number of page faults to resolve > > > when accessed/written? And I thought that should only be fixed with > > > solutions like what Ben has proposed with the MMU rework. Could you show > > > me what I've missed? > > > > Let me try to describe the reason of mmu_lock contention more clearly and the > > effort we tried to do... > > The huge VM only has EPT >= level 2 sptes, and level 1 sptes don't > > exist at the beginning. Write protect all the huge pages will trigger EPT > > violation to create level 1 sptes for all the vCPUs which want to write the > > content of the memory. Different vCPU write the different areas of > > the memory, but they need the same kvm->mmu_lock to create the level 1 > > sptes, this situation will be worse if the number of vCPU and the memory of > > VM is large(in our case 60U512G), meanwhile the VM has > > memory-write-intensive work to do. In order to reduce the mmu_lock > > contention, we try to: write protect VM memory gradually in small chunks, > > such as 1G or 2M. Using a vCPU temporary creately by migration thread to > > split 1G to 2M as the first step, and to split 2M to 4K as the second step > > (this is a little hacking...and I do not know any side effect will be triggered > > indeed). > > Comparing to write protect all VM memory in one go, the write > > protected range is limited in this way and only the vCPUs write this limited > > range will be involved to take the mmu_lock. The contention will be reduced > > since the memory range is small and the number of vCPU involved is small > > too. > > > > Of course, it will take some extra time to split all the huge pages into 4K > > page before the real migration started, about 60s for 512G in my experiment. > > > > During the memory iterative copy phase, PML will do the dirty logging work > > (not write protected case for 4K), or IIRC using fast_page_fault to mark page > > dirty if PML is not supported, which case the mmu_lock does not needed. Yes I missed both of these. Thanks for explaining! Then it makes sense at least to me with your idea. Though instead of the KVM_MEM_FORCE_PT_* naming [1], we can also embed allowed page sizes for the memslot into the flags using a few bits, with another new kvm cap. > > > > Regards, > > Jay Zhou > > (Ah I top-posted I'm sorry. Re-sending at the bottom.) > > FWIW, we currently do this eager splitting at Google for live > migration. When the log-dirty-memory flag is set on a memslot we > eagerly split all pages in the slot down to 4k granularity. > As Jay said, this does not cause crippling lock contention because the > vCPU page faults generated by write protection / splitting can be > resolved in the fast page fault path without acquiring the MMU lock. > I believe +Junaid Shahid tried to upstream this approach at some point > in the past, but the patch set didn't make it in. (This was before my > time, so I'm hoping he has a link.) > I haven't done the analysis to know if eager splitting is more or less > efficient with parallel slow-path page faults, but it's definitely > faster under the MMU lock. Yes, totally agreed. Though comparing to eager splitting (which might still need a new capabilility for the changed behavior after all, not sure...), the per-memslot hint solution looks slightly nicer to me, imho, because it can offer more mechanism than policy. Thanks, -- Peter Xu