Excessive TLB flush ranges

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Mon, 15 May 2023 18:43:40 +0200

Folks!

We're observing massive latencies and slowdowns on ARM32 machines due to
excessive TLB flush ranges.

Those can be observed when tearing down a process, which has a seccomp
BPF filter installed. ARM32 uses the vmalloc area for module space.

bpf_prog_free_deferred()
  vfree()
    _vm_unmap_aliases()
       collect_per_cpu_vmap_blocks: start:0x95c8d000 end:0x95c8e000 size:0x1000 
       __purge_vmap_area_lazy(start:0x95c8d000, end:0x95c8e000)

         va_start:0xf08a1000 va_end:0xf08a5000 size:0x00004000 gap:0x5ac13000 (371731 pages)
         va_start:0xf08a5000 va_end:0xf08a9000 size:0x00004000 gap:0x00000000 (     0 pages)
         va_start:0xf08a9000 va_end:0xf08ad000 size:0x00004000 gap:0x00000000 (     0 pages)
         va_start:0xf08ad000 va_end:0xf08b1000 size:0x00004000 gap:0x00000000 (     0 pages)
         va_start:0xf08b3000 va_end:0xf08b7000 size:0x00004000 gap:0x00002000 (     2 pages)
         va_start:0xf08b7000 va_end:0xf08bb000 size:0x00004000 gap:0x00000000 (     0 pages)
         va_start:0xf08bb000 va_end:0xf08bf000 size:0x00004000 gap:0x00000000 (     0 pages)
         va_start:0xf0a15000 va_end:0xf0a17000 size:0x00002000 gap:0x00156000 (   342 pages)

      flush_tlb_kernel_range(start:0x95c8d000, end:0xf0a17000)

         Does 372106 flush operations where only 31 are useful

So for all architectures which lack a mechanism to do a full TLB flush
in flush_tlb_kernel_range() this takes ages (4-8ms) and slows down
realtime processes on the other CPUs by a factor of two and larger.

So while ARM32, CSKY, NIOS, PPC (some variants), _should_ arguably have
a fallback to tlb_flush_all() when the range is too large, there is
another issue. I've seen a couple of instances where _vm_unmap_aliases()
collects one page and the actual va list has only 2 pages, which might
be eventually worth to flush one by one.

I'm not sure whether that's worth it as checking for those gaps might be
too expensive for the case where a large number of va entries needs to
be flushed.

We'll experiment with a tlb_flush_all() fallback on that ARM32 system in
the next days and see how that works out.

Thanks,

        tglx