On Tue, 30 Aug 2022 20:41:18 +0100, Oliver Upton <oliver.upton@xxxxxxxxx> wrote: > > Presently KVM only takes a read lock for stage 2 faults if it believes > the fault can be fixed by relaxing permissions on a PTE (write unprotect > for dirty logging). Otherwise, stage 2 faults grab the write lock, which > predictably can pile up all the vCPUs in a sufficiently large VM. > > Like the TDP MMU for x86, this series loosens the locking around > manipulations of the stage 2 page tables to allow parallel faults. RCU > and atomics are exploited to safely build/destroy the stage 2 page > tables in light of multiple software observers. > > Patches 1-2 are a cleanup to the way we collapse page tables, with the > added benefit of narrowing the window of time a range of memory is > unmapped. > > Patches 3-7 are minor cleanups and refactorings to the way KVM reads > PTEs and traverses the stage 2 page tables to make it amenable to > concurrent modification. > > Patches 8-9 use RCU to punt page table cleanup out of the vCPU fault > path, which should also improve fault latency a bit. > > Patches 10-14 implement the meat of this series, extending the > 'break-before-make' sequence with atomics to realize locking on PTEs. > Effectively a cmpxchg() is used to 'break' a PTE, thereby serializing > changes to a given PTE. > > Finally, patch 15 flips the switch on all the new code and starts > grabbing the read side of the MMU lock for stage 2 faults. > > Applies to 6.0-rc3. Tested with KVM selftests and benchmarked with > dirty_log_perf_test, scaling from 1 to 48 vCPUs with 4GB of memory per > vCPU backed by THP. > > ./dirty_log_perf_test -s anonymous_thp -m 2 -b 4G -v ${NR_VCPUS} > > Time to dirty memory: > > +-------+---------+------------------+ > | vCPUs | 6.0-rc3 | 6.0-rc3 + series | > +-------+---------+------------------+ > | 1 | 0.89s | 0.92s | > | 2 | 1.13s | 1.18s | > | 4 | 2.42s | 1.25s | > | 8 | 5.03s | 1.36s | > | 16 | 8.84s | 2.09s | > | 32 | 19.60s | 4.47s | > | 48 | 31.39s | 6.22s | > +-------+---------+------------------+ > > It is also worth mentioning that the time to populate memory has > improved: > > +-------+---------+------------------+ > | vCPUs | 6.0-rc3 | 6.0-rc3 + series | > +-------+---------+------------------+ > | 1 | 0.19s | 0.18s | > | 2 | 0.25s | 0.21s | > | 4 | 0.38s | 0.32s | > | 8 | 0.64s | 0.40s | > | 16 | 1.22s | 0.54s | > | 32 | 2.50s | 1.03s | > | 48 | 3.88s | 1.52s | > +-------+---------+------------------+ > > RFC: https://lore.kernel.org/kvmarm/20220415215901.1737897-1-oupton@xxxxxxxxxx/ > > RFC -> v1: > - Factored out page table teardown from kvm_pgtable_stage2_map() > - Use the RCU callback to tear down a subtree, instead of scheduling a > callback for every individual table page. > - Reorganized series to (hopefully) avoid intermediate breakage. > - Dropped the use of page headers, instead stuffing KVM metadata into > page::private directly > > Oliver Upton (14): > KVM: arm64: Add a helper to tear down unlinked stage-2 subtrees > KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make > KVM: arm64: Directly read owner id field in stage2_pte_is_counted() > KVM: arm64: Read the PTE once per visit > KVM: arm64: Split init and set for table PTE > KVM: arm64: Return next table from map callbacks > KVM: arm64: Document behavior of pgtable visitor callback > KVM: arm64: Protect page table traversal with RCU > KVM: arm64: Free removed stage-2 tables in RCU callback > KVM: arm64: Atomically update stage 2 leaf attributes in parallel > walks > KVM: arm64: Make changes block->table to leaf PTEs parallel-aware > KVM: arm64: Make leaf->leaf PTE changes parallel-aware > KVM: arm64: Make table->block changes parallel-aware > KVM: arm64: Handle stage-2 faults in parallel > > arch/arm64/include/asm/kvm_pgtable.h | 59 ++++- > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 7 +- > arch/arm64/kvm/hyp/nvhe/setup.c | 4 +- > arch/arm64/kvm/hyp/pgtable.c | 360 ++++++++++++++++---------- > arch/arm64/kvm/mmu.c | 65 +++-- > 5 files changed, 325 insertions(+), 170 deletions(-) This fails to build on -rc4: MODPOST vmlinux.symvers MODINFO modules.builtin.modinfo GEN modules.builtin CC .vmlinux.export.o LD .tmp_vmlinux.kallsyms1 ld: Unexpected GOT/PLT entries detected! ld: Unexpected run-time procedure linkages detected! ld: ID map text too big or misaligned ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_walk': (.hyp.text+0xdc0c): undefined reference to `__kvm_nvhe___rcu_read_lock' ld: (.hyp.text+0xdc1c): undefined reference to `__kvm_nvhe___rcu_read_unlock' ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_get_leaf': (.hyp.text+0xdc80): undefined reference to `__kvm_nvhe___rcu_read_lock' ld: (.hyp.text+0xdc90): undefined reference to `__kvm_nvhe___rcu_read_unlock' ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_hyp_map': (.hyp.text+0xddb0): undefined reference to `__kvm_nvhe___rcu_read_lock' ld: (.hyp.text+0xddc0): undefined reference to `__kvm_nvhe___rcu_read_unlock' ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_hyp_unmap': (.hyp.text+0xde44): undefined reference to `__kvm_nvhe___rcu_read_lock' ld: (.hyp.text+0xde50): undefined reference to `__kvm_nvhe___rcu_read_unlock' ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_hyp_destroy': (.hyp.text+0xdf40): undefined reference to `__kvm_nvhe___rcu_read_lock' ld: (.hyp.text+0xdf50): undefined reference to `__kvm_nvhe___rcu_read_unlock' ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_map': (.hyp.text+0xe16c): undefined reference to `__kvm_nvhe___rcu_read_lock' ld: (.hyp.text+0xe17c): undefined reference to `__kvm_nvhe___rcu_read_unlock' ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_set_owner': (.hyp.text+0xe264): undefined reference to `__kvm_nvhe___rcu_read_lock' ld: (.hyp.text+0xe274): undefined reference to `__kvm_nvhe___rcu_read_unlock' ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_unmap': (.hyp.text+0xe2d4): undefined reference to `__kvm_nvhe___rcu_read_lock' ld: (.hyp.text+0xe2e4): undefined reference to `__kvm_nvhe___rcu_read_unlock' ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_flush': (.hyp.text+0xe5b4): undefined reference to `__kvm_nvhe___rcu_read_lock' ld: (.hyp.text+0xe5c4): undefined reference to `__kvm_nvhe___rcu_read_unlock' ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_destroy': (.hyp.text+0xe6f0): undefined reference to `__kvm_nvhe___rcu_read_lock' ld: (.hyp.text+0xe700): undefined reference to `__kvm_nvhe___rcu_read_unlock' make[3]: *** [Makefile:1169: vmlinux] Error 1 make[2]: *** [debian/rules:7: build-arch] Error 2 as this drags the RCU read-lock into EL2, and that's not going to work... The following fixes it, but I wonder how you tested it. Thanks, M. diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index dc839db86a1a..adf170122daf 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -580,7 +580,7 @@ enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte); */ enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte); -#if defined(__KVM_NVHE_HYPERVISOR___) +#if defined(__KVM_NVHE_HYPERVISOR__) static inline void kvm_pgtable_walk_begin(void) {} static inline void kvm_pgtable_walk_end(void) {} -- Without deviation from the norm, progress is not possible.