Re: RFC: KVM: x86/mmu: Eager Page Splitting

David Matlack <dmatlack@xxxxxxxxxx> · Mon, 8 Nov 2021 19:57:36 +0000

On Fri, Nov 05, 2021 at 09:44:14AM +0100, Paolo Bonzini wrote:
> On 11/4/21 23:45, David Matlack wrote:
> > The goal of this RFC is to get feedback on "Eager Page Splitting",
> > an optimization that has been in use in Google Cloud since 2016 to
> > reduce the performance impact of live migration on customer workloads.
> > We wanted to get feedback on the feature before delving too far into
> > porting it to the latest upstream kernel for submission.
> > If there is interest in adding this feature to KVM we plan to follow
> > up in the coming months with patches.
> 
> Hi David!
> 
> I'm definitely interested in eager page splitting upstream, but with a
> twist: in order to limit the proliferation of knobs, I would rather
> enable it only when KVM_DIRTY_LOG_INITIALLY_SET is set, and do the split
> on the first KVM_CLEAR_DIRTY_LOG ioctl.
> 
> Initially-all-set does not require write protection when dirty logging
> is enabled; instead, it delays write protection to the first
> KVM_CLEAR_DIRTY_LOG.  In fact, I believe that eager page splitting can
> be enabled unconditionally for initial-all-set.  You would still have
> the benefit of moving the page splitting out of the vCPU run
> path; and because you can smear the cost of splitting over multiple
> calls, most of the disadvantages go away.

Splitting on the first call to KVM_CLEAR_DIRTY_LOG when
initially-all-set is enabled sounds fine to me. But it does require
extra complexity versus unconditionally eager splitting the entire
memslot when dirty logging is enabled, which (I now realize) is needed
to support the ring buffer method. More below...

> 
> Initially-all-set is already the best-performing method for bitmap-based
> dirty page tracking, so it makes sense to focus on it.  Even if Google
> might not be using initial-all-set internally, adding eager page
> splitting to the upstream code would remove most of the delta related to
> it.  The rest of the delta can be tackled later;

Yeah we are still using the legacy clear-on-get-dirty interface.
Upstreaming eager page splitting for initially-all-set would address
most of the delta and give us extra motivation to switch off of
clear-on-get-dirty :).

> I'm not super
> interested in adding eager page splitting for the older methods (clear
> on KVM_GET_DIRTY_LOG, and manual-clear without initially-all-set), but
> it should be useful for the ring buffer method and that *should* share
> most of the code with the older methods.

Using Eager Page Splitting with the ring buffer method would require
splitting the entire memslot when dirty logging is enabled for that
memslot right? Are you saying we should do that?

i.e. in kvm_mmu_slot_apply_flags we'd have something like:

        if (kvm->dirty_ring_size)
                kvm_slot_split_large_pages(kvm, slot);

If so, maybe we should just unconditionally do eager page splitting for
the entire memslot, which would save us from having to add egaer page
splitting in two places.

> 
> > In order to avoid allocating while holding the MMU lock, vCPUs
> > preallocate everything they need to handle the fault and store it in
> > kvm_mmu_memory_cache structs. Eager Page Splitting does the same thing
> > but since it runs outside of a vCPU thread it needs its own copies of
> > kvm_mmu_memory_cache structs. This requires refactoring the
> > way kvm_mmu_memory_cache structs are passed around in the MMU code
> > and adding kvm_mmu_memory_cache structs to kvm_arch.
> 
> That's okay, we can move more arguments to structs if needed in the same
> was as struct kvm_page_fault; or we can use kvm_get_running_vcpu() if
> it's easier or more appropriate.
> 
> > * Increases the duration of the VM ioctls that enable dirty logging.
> > This does not affect customer performance but may have unintended
> > consequences depending on how userspace invokes the ioctl. For example,
> > eagerly splitting a 1.5TB memslot takes 30 seconds.
> 
> This issue goes away (or becomes easier to manage) if it's done in
> KVM_CLEAR_DIRTY_LOG.
> 
> > "RFC: Split EPT huge pages in advance of dirty logging" [1] was a
> > previous proposal to proactively split large pages off of the vCPU
> > threads. However it required faulting in every page in the migration
> > thread, a vCPU-like thread in QEMU, which requires extra userspace
> > support and also is less efficient since it requires faulting.
> 
> Yeah, this is best done on the kernel side.
> 
> > The last alternative is to perform dirty tracking at a 2M granularity.
> > This would reduce the amount of splitting work required
> >  by 512x, making the current approach of splitting on fault less
> > impactful to customer performance. We are in the early stages of
> > investigating 2M dirty tracking internally but it will be a while before
> > it is proven and ready for production. Furthermore there may be
> > scenarios where dirty tracking at 4K would be preferable to reduce
> > the amount of memory that needs to be demand-faulted during precopy.
> 
> Granularity of dirty tracking is somewhat orthogonal to this anyway,
> since you'd have to split 1G pages down to 2M.  So please let me know if
> you're okay with the above twist, and let's go ahead with the plan!
> 
> Paolo
>