On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@xxxxxxxxxx> wrote: > > Presently KVM only takes a read lock for stage 2 faults if it believes > the fault can be fixed by relaxing permissions on a PTE (write unprotect > for dirty logging). Otherwise, stage 2 faults grab the write lock, which > predictably can pile up all the vCPUs in a sufficiently large VM. > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an > MMU protected by the combination of a read-write lock and RCU, allowing > page walkers to traverse in parallel. > > This series is strongly inspired by the mechanics of the TDP MMU, > making use of RCU to protect parallel walks. Note that the TLB > invalidation mechanics are a bit different between x86 and ARM, so we > need to use the 'break-before-make' sequence to split/collapse a > block/table mapping, respectively. An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU arch-neutral and port it to support ARM's stage-2 MMU. This is based on a few observations: - The problems that motivated the development of the TDP MMU are not x86-specific (e.g. parallelizing faults during the post-copy phase of Live Migration). - The synchronization in the TDP MMU (read/write lock, RCU for PT freeing, atomic compare-exchanges for modifying PTEs) is complex, but would be equivalent across architectures. - Eventually RISC-V is going to want similar performance (my understanding is RISC-V MMU is already a copy-paste of the ARM MMU), and it'd be a shame to re-implement TDP MMU synchronization a third time. - The TDP MMU includes support for various performance features that would benefit other architectures, such as eager page splitting, deferred zapping, lockless write-protection resolution, and (coming soon) in-place huge page promotion. - And then there's the obvious wins from less code duplication in KVM (e.g. get rid of the RISC-V MMU copy, increased code test coverage, ...). The side of this I haven't really looked into yet is ARM's stage-2 MMU, and how amenable it would be to being managed by the TDP MMU. But I assume it's a conventional page table structure mapping GPAs to HPAs, which is the most important overlap. That all being said, an arch-neutral TDP MMU would be a larger, more complex code change than something like this series (hence my "v2" caveat above). But I wanted to get this idea out there since the rubber is starting to hit the road on improving ARM MMU scalability. [1] "v2" as in the "next evolution" sense, not the "PATCH v2" sense :) > > Nonetheless, using atomics on the break side allows fault handlers to > acquire exclusive access to a PTE (lets just call it locked). Once the > PTE lock is acquired it is then safe to assume exclusive access. > > Special consideration is required when pruning the page tables in > parallel. Suppose we are collapsing a table into a block. Allowing > parallel faults means that a software walker could be in the middle of > a lower level traversal when the table is unlinked. Table > walkers that prune the paging structures must now 'lock' all descendent > PTEs, effectively asserting exclusive ownership of the substructure > (no other walker can install something to an already locked pte). > > Additionally, for parallel walks we need to punt the freeing of table > pages to the next RCU sync, as there could be multiple observers of the > table until all walkers exit the RCU critical section. For this I > decided to cram an rcu_head into page private data for every table page. > We wind up spending a bit more on table pages now, but lazily allocating > for rcu callbacks probably doesn't make a lot of sense. Not only would > we need a large cache of them (think about installing a level 1 block) > to wire up callbacks on all descendent tables, but we also then need to > spend memory to actually free memory. > > I tried to organize these patches as best I could w/o introducing > intermediate breakage. > > The first 5 patches are meant mostly as prepatory reworks, and, in the > case of RCU a nop. > > Patch 6 is quite large, but I had a hard time deciding how to change the > way we link/unlink tables to use atomics without breaking things along > the way. > > Patch 7 probably should come before patch 6, as it informs the other > read-side fault (perm relax) about when a map is in progress so it'll > back off. > > Patches 8-10 take care of the pruning case, actually locking the child ptes > instead of simply dropping table page references along the way. Note > that we cannot assume a pte points to a table/page at this point, hence > the same helper is called for pre- and leaf-traversal. Guide the > recursion based on what got yanked from the PTE. > > Patches 11-14 wire up everything to schedule rcu callbacks on > to-be-freed table pages. rcu_barrier() is called on the way out from > tearing down a stage 2 page table to guarantee all memory associated > with the VM has actually been cleaned up. > > Patches 15-16 loop in the fault handler to the new table traversal game. > > Lastly, patch 17 is a nasty bit of debugging residue to spot possible > table page leaks. Please don't laugh ;-) > > Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to > exercise the table pruning code. Haven't done anything beyond this, > sending as an RFC now to get eyes on the code. > > Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of > git://git.kernel.dk/linux-block") > > Oliver Upton (17): > KVM: arm64: Directly read owner id field in stage2_pte_is_counted() > KVM: arm64: Only read the pte once per visit > KVM: arm64: Return the next table from map callbacks > KVM: arm64: Protect page table traversal with RCU > KVM: arm64: Take an argument to indicate parallel walk > KVM: arm64: Implement break-before-make sequence for parallel walks > KVM: arm64: Enlighten perm relax path about parallel walks > KVM: arm64: Spin off helper for initializing table pte > KVM: arm64: Tear down unlinked page tables in parallel walk > KVM: arm64: Assume a table pte is already owned in post-order > traversal > KVM: arm64: Move MMU cache init/destroy into helpers > KVM: arm64: Stuff mmu page cache in sub struct > KVM: arm64: Setup cache for stage2 page headers > KVM: arm64: Punt last page reference to rcu callback for parallel walk > KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map() > KVM: arm64: Enable parallel stage 2 MMU faults > TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages > > arch/arm64/include/asm/kvm_host.h | 5 +- > arch/arm64/include/asm/kvm_mmu.h | 2 + > arch/arm64/include/asm/kvm_pgtable.h | 14 +- > arch/arm64/kvm/arm.c | 4 +- > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 13 +- > arch/arm64/kvm/hyp/nvhe/setup.c | 13 +- > arch/arm64/kvm/hyp/pgtable.c | 518 +++++++++++++++++++------- > arch/arm64/kvm/mmu.c | 120 ++++-- > 8 files changed, 503 insertions(+), 186 deletions(-) > > -- > 2.36.0.rc0.470.gd361397f0d-goog >