Re: Excessive TLB flush ranges

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Wed, 17 May 2023 12:31:04 +0200

Nadav!

On Tue, May 16 2023 at 18:23, Nadav Amit wrote:
>> On May 16, 2023, at 5:23 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>>> I'm not ignoring them and I'm well aware of these issues. No need to
>>> repeat them over and over. I'm old but not senile yet.
>
> Thomas, no disrespect was intended. I initially just sent the link and I
> had a sense (based on my past experience) that nobody clicked on it.

All good.

>> It makes a whole lot of a difference whether you do 5 IPIs in a row
>> which all need to get a cache line updated or if you have _one_ which
>> needs a couple of cache lines updated.
>
> Obviously, if the question is 5 IPIs or 1 IPI with more flushing data,
> the 1 IPI wins. The question I was focusing on is whether 1 IPI with
> potentially global flush or detailed list of ranges to flush.  

Correct and there is obviously a tradeoff too, which has yet to be
determined.

>> INVLPG is not serializing so the CPU can pull in the next required cache
>> line(s) on the VA list during that.
>
> Indeed, but ChatGPT says (yes, I see you making fun of me already):
> “however, this doesn't mean INVLPG has no impact on the pipeline. INVLPG
> can cause a pipeline stall because the TLB entry invalidation must be
> completed before subsequent instructions that might rely on the TLB can
> be executed correctly.”
>
> So I am not sure that your claim is exactly correct.

Key is a subsequent instruction which might depend on the to be flushed
TLB entry. That's obvious, but I'm having a hard time to construct that
dependent intruction in this case.

>> These cache lines are _not_
>> contended at that point because _all_ of these data structures are not
>> longer globally accessible (mis-speculation aside) and therefore not
>> exclusive (misalignment aside, but you have to prove that this is an
>> issue).
>
> This is not entirely true. Indeed whether you have 1 remote core or N
> remote core is not a whole issue (putting aside NUMA). But you will get
> first a snoop to the initiator cache by the responding core, and then,
> after the TLB invalidation is completed, an RFO by the initiator once
> it writes to the cache again. If the invalidation data is on the stack
> (as you did), this is even more likely to happen shortly after.

That's correct and there might be smarter ways to handle that list muck.

>> So just dismissing this on 10 years old experience is not really
>> helpful, though I'm happy to confirm your points once I had the time and
>> opportunity to actually run real testing over it, unless you beat me to
>> it.
>
> I really don’t know what “dismissing” you are talking about.

Sorry, I was overreacting due to increased grumpiness.

> I do have relatively recent experience with the overhead of caching
> effects on TLB shootdown time. It can become very apparent. You can
> find some numbers in, for instance, the patch of mine I quoted in my
> previous email.
>
> There are additional opportunities to reduce the caching effect for
> x86, such as combining the SMP-code metadata with the TLB-invalidation
> metadata (which is out of the scope) that I saw having performance
> benefit. That’s all to say that caching effect is not something to
> be considered obsolete.

I never claimed that it does not matter. That's surely part of a
decision making to investigate that.

>> The point is that the generic vmalloc code is making assumptions which
>> are x86 centric on not even necessarily true on x86.
>> 
>> Whether or not this is benefitial on x86 that's a completey separate
>> debate.
>
> I fully understand that if you reduce multiple TLB shootdowns (IPI-wise)
> into 1, it is (pretty much) all benefit and there is no tradeoff. I was
> focusing on the question of whether it is beneficial also to do precise
> TLB flushing, and the tradeoff there is less clear (especially that the
> kernel uses 2MB pages).

For the vmalloc() area mappings? Not really.

> My experience with non-IPI based TLB invalidations is more limited. IIUC
> the usage model is that the TLB shootdowns should be invoked ASAP
> (perhaps each range can be batched, but there is no sense of batching
> multiple ranges), and then later you would issue some barrier to ensure
> prior TLB shootdown invocations have been completed.
>
> If that is the (use) case, I am not sure the abstraction you used in
> your prototype is the best one.

The way how arm/arm64 implement that in software is:

    magic_barrier1();
    flush_range_with_magic_opcodes();
    magic_barrier2();

And for that use case having the list with individual ranges is not
really wrong.

Maybe ARM[64] could do this smarter, but that would require to rewrite a
lot of code I assume.

>> There is also a debate required whether a wholesale "flush on _ALL_
>> CPUs' is justified when some of those CPUs are completely isolated and
>> have absolutely no chance to be affected by that. This process bound
>> seccomp/BPF muck clearly does not justify to kick isolated CPUs out of
>> their computation in user space just because…
>
> I hope you would excuse my ignorance (I am sure you won’t), but isn’t
> the seccomp/BPF VMAP ranges are mapped on all processes (considering
> PTI of course)? Are you suggesting you want a per-process kernel
> address space? (which can make senes, I guess)

Right. The BPF muck is mapped in the global kernel space, but e.g. the
seccomp filters are individual per process. At least that's how I
understand it, but I might be completely wrong.

Thanks,

        tglx