Re: Excessive TLB flush ranges

Nadav Amit <nadav.amit@xxxxxxxxx> · Tue, 16 May 2023 10:56:08 -0700

> On May 16, 2023, at 7:38 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> 
> There is a world outside of x86, but even on x86 it's borderline silly
> to take the whole TLB out when you can flush 3 TLB entries one by one
> with exactly the same number of IPIs, i.e. _one_. No?

I just want to re-raise points that were made in the past, including in
the discussion that I sent before and match my experience.

Feel free to reject them, but I think you should not ignore them.

In a nutshell, there is a tradeoff which is non-trivial. Controlling
the exact ranges that need to be flushed might require, especially in
IPI-based TLB invalidation systems, additional logic and more cache
lines that need to traverse between the caches.

The latter - the cache-lines that hold the ranges that need to be
flushed - are the main issue. They might induce overhead that negates
the benefits if in most cases it turns out that many pages are
flushed. Data structures such as linked-lists might therefore not be
suitable to hold the ranges that need to be flushed, as they are not
cache-friendly. The data that is transferred between the cores to
indicate which ranges should be flushed would ideally be cache line
aligned and fit into a single cache-line.

It is possible that for kernel ranges, where the stride is always a
base-page size (4KB on x86) you might come with more condense way
of communicating TLB flushing ranges of kernel pages than userspace
pages. Perhaps the workload characteristics are different. But it
should be noticed that major parts of the rationale behind the
changes that you suggest could also apply to TLB invalidations of
userspace mapping, as done in tlb_gather and UBC mechanisms. But in
those cases the rationale, at least for x86, was that since the CPU
knows to do TLB refills very efficiently, the extra complexity and
overheads are likely not to worth the trouble.

I hope my feedback is useful. Here is again a link to a discussion
from 2015 about this subject:

https://lore.kernel.org/all/CA+55aFwVUkdaf0_rBk7uJHQjWXu+OcLTHc6FKuCn0Cb2Kvg9NA@xxxxxxxxxxxxxx/

There are several patches that showed the benefit of reducing
cache contention during TLB shootdown. Here is one for example:

https://lore.kernel.org/all/20190423065706.15430-1-namit@xxxxxxxxxx/