On 11/4/21 23:45, David Matlack wrote:
The goal of this RFC is to get feedback on "Eager Page Splitting",
an optimization that has been in use in Google Cloud since 2016 to
reduce the performance impact of live migration on customer
workloads. We wanted to get feedback on the feature before delving
too far into porting it to the latest upstream kernel for submission.
If there is interest in adding this feature to KVM we plan to follow
up in the coming months with patches.
Hi David!
I'm definitely interested in eager page splitting upstream, but with a
twist: in order to limit the proliferation of knobs, I would rather
enable it only when KVM_DIRTY_LOG_INITIALLY_SET is set, and do the split
on the first KVM_CLEAR_DIRTY_LOG ioctl.
Initially-all-set does not require write protection when dirty logging
is enabled; instead, it delays write protection to the first
KVM_CLEAR_DIRTY_LOG. In fact, I believe that eager page splitting can
be enabled unconditionally for initial-all-set. You would still have
the benefit of moving the page splitting out of the vCPU run
path; and because you can smear the cost of splitting over multiple
calls, most of the disadvantages go away.
Initially-all-set is already the best-performing method for bitmap-based
dirty page tracking, so it makes sense to focus on it. Even if Google
might not be using initial-all-set internally, adding eager page
splitting to the upstream code would remove most of the delta related to
it. The rest of the delta can be tackled later; I'm not super
interested in adding eager page splitting for the older methods (clear
on KVM_GET_DIRTY_LOG, and manual-clear without initially-all-set), but
it should be useful for the ring buffer method and that *should* share
most of the code with the older methods.
In order to avoid allocating while holding the MMU lock, vCPUs
preallocate everything they need to handle the fault and store it in
kvm_mmu_memory_cache structs. Eager Page Splitting does the same
thing but since it runs outside of a vCPU thread it needs its own
copies of kvm_mmu_memory_cache structs. This requires refactoring the
way kvm_mmu_memory_cache structs are passed around in the MMU code
and adding kvm_mmu_memory_cache structs to kvm_arch.
That's okay, we can move more arguments to structs if needed in the same
was as struct kvm_page_fault; or we can use kvm_get_running_vcpu() if
it's easier or more appropriate.
* Increases the duration of the VM ioctls that enable dirty logging.
This does not affect customer performance but may have unintended
consequences depending on how userspace invokes the ioctl. For
example, eagerly splitting a 1.5TB memslot takes 30 seconds.
This issue goes away (or becomes easier to manage) if it's done in
KVM_CLEAR_DIRTY_LOG.
"RFC: Split EPT huge pages in advance of dirty logging" [1] was a
previous proposal to proactively split large pages off of the vCPU
threads. However it required faulting in every page in the migration
thread, a vCPU-like thread in QEMU, which requires extra userspace
support and also is less efficient since it requires faulting.
Yeah, this is best done on the kernel side.
The last alternative is to perform dirty tracking at a 2M
granularity. This would reduce the amount of splitting work required
by 512x, making the current approach of splitting on fault less
impactful to customer performance. We are in the early stages of
investigating 2M dirty tracking internally but it will be a while
before it is proven and ready for production. Furthermore there may
be scenarios where dirty tracking at 4K would be preferable to reduce
the amount of memory that needs to be demand-faulted during precopy.
Granularity of dirty tracking is somewhat orthogonal to this anyway,
since you'd have to split 1G pages down to 2M. So please let me know if
you're okay with the above twist, and let's go ahead with the plan!
Paolo