On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote: > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test) > on arm64, specifically the mprotect() testcase[1], as performance is significantly > worse compared to x86, Sharing what we discussed offline: Sean was using a machine w/o FEAT_FWB for this test, so the increased runtime on arm64 is likely explained by the CMOs we're doing when creating or invalidating a stage-2 PTE. Using a machine w/ FEAT_FWB would be better for making these sort of cross-architecture comparisons. Beyond CMOs, we do have some > and there might be bugs lurking the mmu_notifier flows. Impossible! :) > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page > fault path for valid VMAs does introduce a performance regression, but only ~30%, > not the ~6x jump from x86 to arm64. So that too makes it unlikely taking mmap_lock > is the main problem, though it's still good justification for avoid mmap_lock in > the page fault path. I'm curious how much of that 30% in a microbenchmark would translate to real world performance, since it isn't *that* egregious. We also have other uses for getting at the VMA beyond mapping granularity (MTE and the VFIO Normal-NC hint) that'd require some attention too. -- Thanks, Oliver