The goal of this RFC is to get feedback on "Eager Page Splitting", an optimization that has been in use in Google Cloud since 2016 to reduce the performance impact of live migration on customer workloads. We wanted to get feedback on the feature before delving too far into porting it to the latest upstream kernel for submission. If there is interest in adding this feature to KVM we plan to follow up in the coming months with patches. Background ========== When KVM is tracking writes for dirty logging it write-protects any 2MiB or 1GiB pages that are mapped into the guest. When a vCPU writes to such a page a write-protection fault will occur which KVM handles by allocating lower level page tables, mapping in the faulting address with a PT-level (4KiB) SPTE, and recording the 4KiB page is dirty. Handling these faults is done under the MMU lock. In the TDP MMU, where the MMU lock is a rwlock rather than a spin lock, splitting is done while holding the lock in read (shared) mode and atomic compare-exchanges are used when modifying SPTEs to detect races and retry. Motivation ========== The write-protection faults to break down 2MiB and 1GiB mappings into 4KiB mappings are taken on the critical path of guest execution, which negatively impacts guest performance. The negative impact scales with the number of vCPUs, since each vCPU contends for the MMU lock. This overhead can be seen by running dirty_log_perf_test with 1GiB per vCPU and comparing the time it takes the test to dirty memory after enabling dirty logging when the backing source is `anonymous` (4KiB pages) and `anonymous_hugetlb_1gb` (1GiB pages). | First Pass Dirty Memory Time | | tdp_mmu=N | tdp_mmu=Y | vCPUs | 4KiB | 1GiB | 4KiB | 1GiB | ------- | --------- | --------- | --------- | --------- | 1 | 0.063s | 0.241s | 0.061s | 0.238s | 2 | 0.066s | 0.280s | 0.066s | 0.278s | 4 | 0.069s | 0.359s | 0.070s | 0.298s | 8 | 0.112s | 0.982s | 0.109s | 0.345s | 16 | 0.197s | 3.153s | 0.183s | 1.020s | 32 | 0.368s | 9.293s | 0.425s | 2.610s | 64 | 0.456s | 23.291s | 0.354s | 4.212s | 128 | 0.334s | 55.030s | 0.419s | 7.169s | 256 | 0.576s | 141.332s | 0.492s | 13.874s | 416 | 0.881s | 338.185s | 0.785s | 14.582s | The performance overhead is egregious with the legacy MMU, as expected, since every fault requires contending for exclusive access to MMU lock. However, even with the TDP MMU, where the MMU lock is held in read-mode, perf recording confirms there is still contention due to hammering of atomic operations that scales with the number of vCPUs: + 28.16% [k] _raw_read_lock + 28.10% [k] direct_page_fault + 21.52% [k] tdp_mmu_set_spte_atomic + 6.90% [k] __handle_changed_spte + 3.93% [k] __get_current_cr3_fast + 3.47% [k] lockless_pages_from_mm _raw_read_lock, direct_page_fault, tdp_mmu_set_spte_atomic, and __handle_changed_spte each spend 99+% of their time on atomic operations. Note the _raw_read_lock path specifically is pure reader/reader contention, it is not spinning waiting for writers. As of Nov 2021, Google Cloud supports live migrating VMs with up to 416 vCPUs and 28 GiB per vCPU. The customers using large instances tend to run applications that are sensitive to abrupt performance degradations. For these VMs we use Eager Page Splitting in conjunction with the Direct MMU (our internal predecessor to the TDP MMU). Design ====== Eager Page Splitting occurs when dirty logging is enabled on a region of memory. Before KVM write-protects all large page mapping, we attempt the following two steps (*): 1. Iterate through all 1GiB SPTEs and for each: a. Allocate a new shadow page table. b. Populate it with 2MiB SPTEs mapping the 1GiB page. c. Replace the 1GiB SPTE with a link to the shadow page table. 2. Iterate through all 2MiB SPTEs and for each: a. Allocate a new shadow page table. b. Populate it with 4KiB SPTEs mapping the 2MiB page. c. Replace the 2MiB SPTE with a link to the shadow page table. (*) We could split 1G pages directly down to 4K, and then split 2MiB down to 4K. But the two step approach is a bit simpler to understand and implement. The implementation of splitting a 1GiB SPTE into 512 2MiB SPTEs and a 2MiB SPTE into a 512 4KiB SPTEs is generalized into a new function that splits an SPTE into one level smaller. The lower level SPTEs receive the same set of permissions as the upper level SPTEs. The exception is execution permissions, which is granted for 4K SPTEs if the large SPTE was forced NX due to HugePage NX. Eager Page Splitting can gracefully fall back to write-protection, if we decide the complexity is not worth splitting in certain scenarios, since the existing write-protection logic runs after Eager Page Splitting. This makes it possible to build Eager Page Splitting iteratively (e.g. just supporting direct-map-only GFNs, then adding support for rmap, etc.). This also means that it is trivial to make Eager Page Splitting an optional capability if desired. Allocations ----------- In order to avoid allocating while holding the MMU lock, vCPUs preallocate everything they need to handle the fault and store it in kvm_mmu_memory_cache structs. Eager Page Splitting does the same thing but since it runs outside of a vCPU thread it needs its own copies of kvm_mmu_memory_cache structs. This requires refactoring the way kvm_mmu_memory_cache structs are passed around in the MMU code and adding kvm_mmu_memory_cache structs to kvm_arch. Before splitting a large page, Eager Page Splitting checks if it has enough memory in its caches to fully split the page. If it doesn't, it flushes TLBs, drops the MMU lock, cond_rescheds, topups the caches, reacquires the MMU lock, and continue where it left off. Pros/Cons --------- The benefits of Eager Page Splitting are: * Eliminates the overhead of enabling dirty logging when using large pages. vCPUs will only take write-protection faults on 4KiB pages, which can be handled without acquiring the MMU lock (fast_page_fault), regardless of which page sizes were used before the migration started. * Eager Page Splitting is more efficient overall. We do not have to pay the VM-exit and fault handling costs for every 4KiB of guest memory, the splitter benefits from cache locality, and TLB flushes can be batched. When using the tdp_mmu, the new child page table can be populated without atomics. The downsides of Eager Page Splitting are: * Introduces a new and complex path for modifying KVM's page tables outside of the vCPU fault path. * Increases the duration of the VM ioctls that enable dirty logging. This does not affect customer performance but may have unintended consequences depending on how userspace invokes the ioctl. For example, eagerly splitting a 1.5TB memslot takes 30 seconds. * May increase memory usage since it allocates the worst case number of page tables in order to map all memory at 4KiB. Alternatives ============ "RFC: Split EPT huge pages in advance of dirty logging" [1] was a previous proposal to proactively split large pages off of the vCPU threads. However it required faulting in every page in the migration thread, a vCPU-like thread in QEMU, which requires extra userspace support and also is less efficient since it requires faulting. Another alternative is to modify the vCPU fault handler to map in the entire large page when handling write-protection faults on large pages, rather than just mapping in the faulting 4KiB region. This is a middle ground approach that would be more efficient than the current solution but has yet to be prototyped or proven in a production environment. The last alternative is to perform dirty tracking at a 2M granularity. This would reduce the amount of splitting work required by 512x, making the current approach of splitting on fault less impactful to customer performance. We are in the early stages of investigating 2M dirty tracking internally but it will be a while before it is proven and ready for production. Furthermore there may be scenarios where dirty tracking at 4K would be preferable to reduce the amount of memory that needs to be demand-faulted during precopy. Credit ====== Eager Page Splitting was originally designed and implemented by Peter Feiner <pfeiner@xxxxxxxxxx> for Google's internal kernel in 2016. Appendix ======== In order to collect the performance results I ran the following commands on an Intel Cascade Lake host that is used to run Google Cloud's m2-ultramem-416 VMs: echo Y > /sys/module/kvm/parameters/tdp_mmu ./dirty_log_perf_test -v${vcpus} -s anonymous ./dirty_log_perf_test -v${vcpus} -s anonymous_hugetlb_1gb echo N > /sys/module/kvm/parameters/tdp_mmu ./dirty_log_perf_test -v${vcpus} -s anonymous ./dirty_log_perf_test -v${vcpus} -s anonymous_hugetlb_1gb [1] https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg04774.html