Re: Excessive TLB flush ranges

Nadav Amit <nadav.amit@xxxxxxxxx> · Wed, 17 May 2023 15:57:22 -0700

> On May 17, 2023, at 3:31 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> 
>>> The point is that the generic vmalloc code is making assumptions which
>>> are x86 centric on not even necessarily true on x86.
>>> 
>>> Whether or not this is benefitial on x86 that's a completey separate
>>> debate.
>> 
>> I fully understand that if you reduce multiple TLB shootdowns (IPI-wise)
>> into 1, it is (pretty much) all benefit and there is no tradeoff. I was
>> focusing on the question of whether it is beneficial also to do precise
>> TLB flushing, and the tradeoff there is less clear (especially that the
>> kernel uses 2MB pages).
> 
> For the vmalloc() area mappings? Not really.

The main penalty of doing a global flush are the innocent bystanders TLB
translations. These are likely the regular mappings, not the malloc-ones.

> 
>> My experience with non-IPI based TLB invalidations is more limited. IIUC
>> the usage model is that the TLB shootdowns should be invoked ASAP
>> (perhaps each range can be batched, but there is no sense of batching
>> multiple ranges), and then later you would issue some barrier to ensure
>> prior TLB shootdown invocations have been completed.
>> 
>> If that is the (use) case, I am not sure the abstraction you used in
>> your prototype is the best one.
> 
> The way how arm/arm64 implement that in software is:
> 
>    magic_barrier1();
>    flush_range_with_magic_opcodes();
>    magic_barrier2();
> 
> And for that use case having the list with individual ranges is not
> really wrong.
> 
> Maybe ARM[64] could do this smarter, but that would require to rewrite a
> lot of code I assume.

What you say makes sense - and I actually see that flush_tlb_page_nosync()
needs a memory barrier.

I just encountered recent patches that did the flushing on ARM in an
async manner as I described. That is the reason I assumed it is more efficient.

https://lore.kernel.org/linux-mm/20230410134352.4519-3-yangyicong@xxxxxxxxxx/

> 
>>> There is also a debate required whether a wholesale "flush on _ALL_
>>> CPUs' is justified when some of those CPUs are completely isolated and
>>> have absolutely no chance to be affected by that. This process bound
>>> seccomp/BPF muck clearly does not justify to kick isolated CPUs out of
>>> their computation in user space just because…
>> 
>> I hope you would excuse my ignorance (I am sure you won’t), but isn’t
>> the seccomp/BPF VMAP ranges are mapped on all processes (considering
>> PTI of course)? Are you suggesting you want a per-process kernel
>> address space? (which can make senes, I guess)
> 
> Right. The BPF muck is mapped in the global kernel space, but e.g. the
> seccomp filters are individual per process. At least that's how I
> understand it, but I might be completely wrong.

After rehashing the seccomp man page, they are not entirely “private” for
each process, as they are maintained after fork/exec. Yet, one can imagine
that it is possible to create non-global kernel mappings that would be
mapped per-process and would hold the seccomp filters. This would remove
the need to do system-wide flushes when the process dies.