On Wed, Aug 14, 2024 at 04:28:00PM -0700, Oliver Upton wrote: > On Wed, Aug 14, 2024 at 01:54:04PM -0700, Sean Christopherson wrote: > > TL;DR: it's probably worth looking at mmu_stress_test (was: max_guest_memory_test) > > on arm64, specifically the mprotect() testcase[1], as performance is significantly > > worse compared to x86, > > Sharing what we discussed offline: > > Sean was using a machine w/o FEAT_FWB for this test, so the increased > runtime on arm64 is likely explained by the CMOs we're doing when > creating or invalidating a stage-2 PTE. > > Using a machine w/ FEAT_FWB would be better for making these sort of > cross-architecture comparisons. Beyond CMOs, we do have some ... some heavy barriers (e.g. DSB(ishst)) we use to ensure page table updates are visible to the system. So there could still be some arch-specific quirks that'll show up in the test. > > and there might be bugs lurking the mmu_notifier flows. > > Impossible! :) > > > Jumping back to mmap_lock, adding a lock, vma_lookup(), and unlock in x86's page > > fault path for valid VMAs does introduce a performance regression, but only ~30%, > > not the ~6x jump from x86 to arm64. So that too makes it unlikely taking mmap_lock > > is the main problem, though it's still good justification for avoid mmap_lock in > > the page fault path. > > I'm curious how much of that 30% in a microbenchmark would translate to > real world performance, since it isn't *that* egregious. We also have > other uses for getting at the VMA beyond mapping granularity (MTE and > the VFIO Normal-NC hint) that'd require some attention too. > > -- > Thanks, > Oliver