On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@xxxxxxxxxx> wrote: > > Presently KVM only takes a read lock for stage 2 faults if it believes > the fault can be fixed by relaxing permissions on a PTE (write unprotect > for dirty logging). Otherwise, stage 2 faults grab the write lock, which > predictably can pile up all the vCPUs in a sufficiently large VM. > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an > MMU protected by the combination of a read-write lock and RCU, allowing > page walkers to traverse in parallel. > > This series is strongly inspired by the mechanics of the TDP MMU, > making use of RCU to protect parallel walks. Note that the TLB > invalidation mechanics are a bit different between x86 and ARM, so we > need to use the 'break-before-make' sequence to split/collapse a > block/table mapping, respectively. > > Nonetheless, using atomics on the break side allows fault handlers to > acquire exclusive access to a PTE (lets just call it locked). Once the > PTE lock is acquired it is then safe to assume exclusive access. > > Special consideration is required when pruning the page tables in > parallel. Suppose we are collapsing a table into a block. Allowing > parallel faults means that a software walker could be in the middle of > a lower level traversal when the table is unlinked. Table > walkers that prune the paging structures must now 'lock' all descendent > PTEs, effectively asserting exclusive ownership of the substructure > (no other walker can install something to an already locked pte). > > Additionally, for parallel walks we need to punt the freeing of table > pages to the next RCU sync, as there could be multiple observers of the > table until all walkers exit the RCU critical section. For this I > decided to cram an rcu_head into page private data for every table page. > We wind up spending a bit more on table pages now, but lazily allocating > for rcu callbacks probably doesn't make a lot of sense. Not only would > we need a large cache of them (think about installing a level 1 block) > to wire up callbacks on all descendent tables, but we also then need to > spend memory to actually free memory. FWIW we used a similar approach in early versions of the TDP MMU, but instead of page->private used page->lru so that more metadata could be stored in page->private. Ultimately that ended up being too limiting and we decided to switch to just using the associated struct kvm_mmu_page as the list element. I don't know if ARM has an equivalent construct though. > > I tried to organize these patches as best I could w/o introducing > intermediate breakage. > > The first 5 patches are meant mostly as prepatory reworks, and, in the > case of RCU a nop. > > Patch 6 is quite large, but I had a hard time deciding how to change the > way we link/unlink tables to use atomics without breaking things along > the way. > > Patch 7 probably should come before patch 6, as it informs the other > read-side fault (perm relax) about when a map is in progress so it'll > back off. > > Patches 8-10 take care of the pruning case, actually locking the child ptes > instead of simply dropping table page references along the way. Note > that we cannot assume a pte points to a table/page at this point, hence > the same helper is called for pre- and leaf-traversal. Guide the > recursion based on what got yanked from the PTE. > > Patches 11-14 wire up everything to schedule rcu callbacks on > to-be-freed table pages. rcu_barrier() is called on the way out from > tearing down a stage 2 page table to guarantee all memory associated > with the VM has actually been cleaned up. > > Patches 15-16 loop in the fault handler to the new table traversal game. > > Lastly, patch 17 is a nasty bit of debugging residue to spot possible > table page leaks. Please don't laugh ;-) > > Smoke tested with KVM selftests + kvm_page_table_test w/ 2M hugetlb to > exercise the table pruning code. Haven't done anything beyond this, > sending as an RFC now to get eyes on the code. > > Applies to commit fb649bda6f56 ("Merge tag 'block-5.18-2022-04-15' of > git://git.kernel.dk/linux-block") > > Oliver Upton (17): > KVM: arm64: Directly read owner id field in stage2_pte_is_counted() > KVM: arm64: Only read the pte once per visit > KVM: arm64: Return the next table from map callbacks > KVM: arm64: Protect page table traversal with RCU > KVM: arm64: Take an argument to indicate parallel walk > KVM: arm64: Implement break-before-make sequence for parallel walks > KVM: arm64: Enlighten perm relax path about parallel walks > KVM: arm64: Spin off helper for initializing table pte > KVM: arm64: Tear down unlinked page tables in parallel walk > KVM: arm64: Assume a table pte is already owned in post-order > traversal > KVM: arm64: Move MMU cache init/destroy into helpers > KVM: arm64: Stuff mmu page cache in sub struct > KVM: arm64: Setup cache for stage2 page headers > KVM: arm64: Punt last page reference to rcu callback for parallel walk > KVM: arm64: Allow parallel calls to kvm_pgtable_stage2_map() > KVM: arm64: Enable parallel stage 2 MMU faults > TESTONLY: KVM: arm64: Add super lazy accounting of stage 2 table pages > > arch/arm64/include/asm/kvm_host.h | 5 +- > arch/arm64/include/asm/kvm_mmu.h | 2 + > arch/arm64/include/asm/kvm_pgtable.h | 14 +- > arch/arm64/kvm/arm.c | 4 +- > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 13 +- > arch/arm64/kvm/hyp/nvhe/setup.c | 13 +- > arch/arm64/kvm/hyp/pgtable.c | 518 +++++++++++++++++++------- > arch/arm64/kvm/mmu.c | 120 ++++-- > 8 files changed, 503 insertions(+), 186 deletions(-) > > -- > 2.36.0.rc0.470.gd361397f0d-goog >