Hi Peter, > -----Original Message----- > From: Peter Xu [mailto:peterx@xxxxxxxxxx] > Sent: Wednesday, February 19, 2020 1:43 AM > To: Zhoujian (jay) <jianjay.zhou@xxxxxxxxxx> > Cc: kvm@xxxxxxxxxxxxxxx; qemu-devel@xxxxxxxxxx; pbonzini@xxxxxxxxxx; > dgilbert@xxxxxxxxxx; quintela@xxxxxxxxxx; Liujinsong (Paul) > <liu.jinsong@xxxxxxxxxx>; linfeng (M) <linfeng23@xxxxxxxxxx>; wangxin (U) > <wangxinxin.wang@xxxxxxxxxx>; Huangweidong (C) > <weidong.huang@xxxxxxxxxx> > Subject: Re: RFC: Split EPT huge pages in advance of dirty logging > > On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote: > > Hi all, > > > > We found that the guest will be soft-lockup occasionally when live > > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The > > reason is clear, almost all of the vCPUs are waiting for the KVM MMU > > spin-lock to create 4K SPTEs when the huge pages are write protected. This > phenomenon is also described in this patch set: > > https://patchwork.kernel.org/cover/11163459/ > > which aims to handle page faults in parallel more efficiently. > > > > Our idea is to use the migration thread to touch all of the guest > > memory in the granularity of 4K before enabling dirty logging. To be > > more specific, we split all the PDPE_LEVEL SPTEs into DIRECTORY_LEVEL > > SPTEs as the first step, and then split all the DIRECTORY_LEVEL SPTEs into > PAGE_TABLE_LEVEL SPTEs as the following step. > > IIUC, QEMU will prefer to use huge pages for all the anonymous ramblocks > (please refer to ram_block_add): > > qemu_madvise(new_block->host, new_block->max_length, > QEMU_MADV_HUGEPAGE); Yes, you're right > > Another alternative I can think of is to add an extra parameter to QEMU to > explicitly disable huge pages (so that can even be MADV_NOHUGEPAGE > instead of MADV_HUGEPAGE). However that should also drag down the > performance for the whole lifecycle of the VM. >From the performance point of view, it is better to keep the huge pages when the VM is not in the live migration state. > A 3rd option is to make a QMP > command to dynamically turn huge pages on/off for ramblocks globally. We're searching a dynamic method too. We plan to add two new flags for each memory slot, say KVM_MEM_FORCE_PT_DIRECTORY_PAGES and KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set through KVM_SET_USER_MEMORY_REGION ioctl. The mapping_level which is called by tdp_page_fault in the kernel side will return PT_DIRECTORY_LEVEL if the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag of the memory slot is set, and return PT_PAGE_TABLE_LEVEL if the KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set. The key steps to split the huge pages in advance of enabling dirty log is as follows: 1. The migration thread in user space uses KVM_SET_USER_MEMORY_REGION ioctl to set the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each memory slot. 2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES ioctl (which is newly added) to do the splitting of large pages in the kernel side. 3. A new vCPU is created temporally(do some initialization but will not run) to help to do the work, i.e. as the parameter of the tdp_page_fault. 4. Collect the GPA ranges of all the memory slots with the KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set. 5. Split the 1G huge pages(collected in step 4) into 2M by calling tdp_page_fault, since the mapping_level will return PT_DIRECTORY_LEVEL. Here is the main difference from the usual path which is caused by the Guest side(EPT violation/misconfig etc), we call it directly in the hypervisor side. 6. Do some cleanups, i.e. free the vCPU related resources 7. The KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side. 8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7, in step 5 the 2M huge pages will be splitted into 4K pages. 9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot. 10. Then the migration thread calls the log_start ioctl to enable the dirty logging, and the remaining thing is the same. What's your take on this, thanks. Regards, Jay Zhou > Haven't thought deep into any of them, but seems doable. > > Thanks, > > -- > Peter Xu