On Wed, May 17 2023 at 15:43, Mark Rutland wrote: > On Wed, May 17, 2023 at 12:31:04PM +0200, Thomas Gleixner wrote: >> The way how arm/arm64 implement that in software is: >> >> magic_barrier1(); >> flush_range_with_magic_opcodes(); >> magic_barrier2(); > > FWIW, on arm64 that sequence (for leaf entries only) is: > > /* > * Make sure prior writes to the page table entries are visible to all > * CPUs, so that *subsequent* page table walks will see the latest > * values. > * > * This is roughly __smp_wmb(). > */ > dsb(ishst) // AKA magic_barrier1() > > /* > * The "TLBI *IS, <addr>" instructions send a message to all other > * CPUs, essentially saying "please start invalidating entries for > * <addr>" > * > * The "TLBI *ALL*IS" instructions send a message to all other CPUs, > * essentially saying "please start invalidating all entries". > * > * In theory, this could be for discontiguous ranges. > */ > flush_range_with_magic_opcodes() > > /* > * Wait for acknowledgement that all prior TLBIs have completed. This > * also ensures that all accesses using those translations have also > * completed. > * > * This waits for all relevant CPUs to acknowledge completion of any > * prior TLBIs sent by this CPU. > */ > dsb(ish) // AKA magic_barrier2() > isb() > > So you can batch a bunch of "TLBI *IS, <addr>" with a single barrier for > completion, or you can use a single "TLBI *ALL*IS" to invalidate everything. > > It can still be worth using the latter, as arm64 has done since commit: > > 05ac65305437e8ef ("arm64: fix soft lockup due to large tlb flush range") > > ... as for a large range, issuing a bunch of "TLBI *IS, <addr>" can take a > while, and can require the recipient CPUs to do more work than they might have > to do for a single "TLBI *ALL*IS". And looking at the changelog and backtrace: PC is at __cpu_flush_kern_tlb_range+0xc/0x40 LR is at __purge_vmap_area_lazy+0x28c/0x3ac I'm willing to bet that this is exactly the same scenario of a direct map + module area flush. That's the only one we found so far which creates insanely large ranges. The other effects of coalescing can still result in seriously oversized flushs for just a couple of pages. The worst I've seen aside of that BPF muck was a 'flush 2 pages' with an resulting range of ~3.8MB. > The point at which invalidating everything is better depends on a number of > factors (e.g. the impact of all CPUs needing to make new page table walks), and > currently we have an arbitrary boundary where we choose to invalidate > everything (which has been tweaked a bit over time); there isn't really a > one-size-fits-all best answer. I'm well aware of that :) Thanks, tglx