Re: Excessive TLB flush ranges

Baoquan He <bhe@xxxxxxxxxx> · Tue, 16 May 2023 10:26:57 +0800

On 05/15/23 at 08:17pm, Uladzislau Rezki wrote:
> On Mon, May 15, 2023 at 06:43:40PM +0200, Thomas Gleixner wrote:
> > Folks!
> > 
> > We're observing massive latencies and slowdowns on ARM32 machines due to
> > excessive TLB flush ranges.
> > 
> > Those can be observed when tearing down a process, which has a seccomp
> > BPF filter installed. ARM32 uses the vmalloc area for module space.
> > 
> > bpf_prog_free_deferred()
> >   vfree()
> >     _vm_unmap_aliases()
> >        collect_per_cpu_vmap_blocks: start:0x95c8d000 end:0x95c8e000 size:0x1000 
> >        __purge_vmap_area_lazy(start:0x95c8d000, end:0x95c8e000)
> > 
> >          va_start:0xf08a1000 va_end:0xf08a5000 size:0x00004000 gap:0x5ac13000 (371731 pages)
> >          va_start:0xf08a5000 va_end:0xf08a9000 size:0x00004000 gap:0x00000000 (     0 pages)
> >          va_start:0xf08a9000 va_end:0xf08ad000 size:0x00004000 gap:0x00000000 (     0 pages)
> >          va_start:0xf08ad000 va_end:0xf08b1000 size:0x00004000 gap:0x00000000 (     0 pages)
> >          va_start:0xf08b3000 va_end:0xf08b7000 size:0x00004000 gap:0x00002000 (     2 pages)
> >          va_start:0xf08b7000 va_end:0xf08bb000 size:0x00004000 gap:0x00000000 (     0 pages)
> >          va_start:0xf08bb000 va_end:0xf08bf000 size:0x00004000 gap:0x00000000 (     0 pages)
> >          va_start:0xf0a15000 va_end:0xf0a17000 size:0x00002000 gap:0x00156000 (   342 pages)
> > 
> >       flush_tlb_kernel_range(start:0x95c8d000, end:0xf0a17000)
> > 
> >          Does 372106 flush operations where only 31 are useful
> > 
> > So for all architectures which lack a mechanism to do a full TLB flush
> > in flush_tlb_kernel_range() this takes ages (4-8ms) and slows down
> > realtime processes on the other CPUs by a factor of two and larger.
> > 
> > So while ARM32, CSKY, NIOS, PPC (some variants), _should_ arguably have
> > a fallback to tlb_flush_all() when the range is too large, there is
> > another issue. I've seen a couple of instances where _vm_unmap_aliases()
> > collects one page and the actual va list has only 2 pages, which might
> > be eventually worth to flush one by one.
> > 
> > I'm not sure whether that's worth it as checking for those gaps might be
> > too expensive for the case where a large number of va entries needs to
> > be flushed.
> > 
> > We'll experiment with a tlb_flush_all() fallback on that ARM32 system in
> > the next days and see how that works out.
> >
> For systems which lack a full TLB flush and to flush a long range is
> a problem(it takes time), probably we can flush VA one by one. Because
> currently we calculate a flush range [min:max] and that range includes
> the space that might not be mapped at all. Like below:

It's fine if we only calculate a flush range of [min:max] with VA. In
vm_reset_perms(), it calculates the flush range with the impacted direct
mapping range, then merge it with VA's range. That looks really strange
and surprising. If the vm->pages[] are got from a lower part of physical
memory, the final merged flush will span tremendous range. Wondering why
we need merge the direct map range with VA range, then do flush. Not
sure if I misunderstand it.

> 
> 
>   VA_1                               VA_2      
>  |....|-------------------------|............|
> 10   12                         60           68
> 
> . mapped;
> - not mapped.
> 
> so we flush from 10 until 68. Instead, probably we can do a flush of VA_1
> range and VA_2 range. On modern systems with many CPUs, it could be a big
> slow down.