RE: RFC: Split EPT huge pages in advance of dirty logging

"Zhoujian (jay)" <jianjay.zhou@xxxxxxxxxx> · Wed, 19 Feb 2020 13:19:08 +0000

Hi Peter,

> -----Original Message-----
> From: Peter Xu [mailto:peterx@xxxxxxxxxx]
> Sent: Wednesday, February 19, 2020 1:43 AM
> To: Zhoujian (jay) <jianjay.zhou@xxxxxxxxxx>
> Cc: kvm@xxxxxxxxxxxxxxx; qemu-devel@xxxxxxxxxx; pbonzini@xxxxxxxxxx;
> dgilbert@xxxxxxxxxx; quintela@xxxxxxxxxx; Liujinsong (Paul)
> <liu.jinsong@xxxxxxxxxx>; linfeng (M) <linfeng23@xxxxxxxxxx>; wangxin (U)
> <wangxinxin.wang@xxxxxxxxxx>; Huangweidong (C)
> <weidong.huang@xxxxxxxxxx>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> 
> On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > Hi all,
> >
> > We found that the guest will be soft-lockup occasionally when live
> > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > reason is clear, almost all of the vCPUs are waiting for the KVM MMU
> > spin-lock to create 4K SPTEs when the huge pages are write protected. This
> phenomenon is also described in this patch set:
> > https://patchwork.kernel.org/cover/11163459/
> > which aims to handle page faults in parallel more efficiently.
> >
> > Our idea is to use the migration thread to touch all of the guest
> > memory in the granularity of 4K before enabling dirty logging. To be
> > more specific, we split all the PDPE_LEVEL SPTEs into DIRECTORY_LEVEL
> > SPTEs as the first step, and then split all the DIRECTORY_LEVEL SPTEs into
> PAGE_TABLE_LEVEL SPTEs as the following step.
> 
> IIUC, QEMU will prefer to use huge pages for all the anonymous ramblocks
> (please refer to ram_block_add):
> 
>         qemu_madvise(new_block->host, new_block->max_length,
> QEMU_MADV_HUGEPAGE);

Yes, you're right

> 
> Another alternative I can think of is to add an extra parameter to QEMU to
> explicitly disable huge pages (so that can even be MADV_NOHUGEPAGE
> instead of MADV_HUGEPAGE).  However that should also drag down the
> performance for the whole lifecycle of the VM.  

>From the performance point of view, it is better to keep the huge pages
when the VM is not in the live migration state.

> A 3rd option is to make a QMP
> command to dynamically turn huge pages on/off for ramblocks globally.

We're searching a dynamic method too.
We plan to add two new flags for each memory slot, say
KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set
through KVM_SET_USER_MEMORY_REGION ioctl.

The mapping_level which is called by tdp_page_fault in the kernel side
will return PT_DIRECTORY_LEVEL if the
KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag of the memory slot is
set, and return PT_PAGE_TABLE_LEVEL if the
KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flag is set.

The key steps to split the huge pages in advance of enabling dirty log is
as follows:
1. The migration thread in user space uses
KVM_SET_USER_MEMORY_REGION ioctl to set the
KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each memory slot.
2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
ioctl (which is newly added) to do the splitting of large pages in the
kernel side.
3. A new vCPU is created temporally(do some initialization but will not
run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
4. Collect the GPA ranges of all the memory slots with the
KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag set.
5. Split the 1G huge pages(collected in step 4) into 2M by calling
tdp_page_fault, since the mapping_level will return
PT_DIRECTORY_LEVEL. Here is the main difference from the usual
path which is caused by the Guest side(EPT violation/misconfig etc),
we call it directly in the hypervisor side.
6. Do some cleanups, i.e. free the vCPU related resources
7. The KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
8. Using KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES instread of
KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7,
in step 5 the 2M huge pages will be splitted into 4K pages.
9. Clear the KVM_MEM_FORCE_PT_DIRECTORY_PAGES and
KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
10. Then the migration thread calls the log_start ioctl to enable the dirty
logging, and the remaining thing is the same.

What's your take on this, thanks.

Regards,
Jay Zhou

> Haven't thought deep into any of them, but seems doable.
> 
> Thanks,
> 
> --
> Peter Xu