Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes

Nadav Amit <nadav.amit@xxxxxxxxx> · Fri, 10 Jan 2025 08:07:56 +0200

> On 9 Jan 2025, at 23:18, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
> 
> But actually I think INVLPGB is *WAY* better than INVLPG here.  INVLPG
> doesn't have ranged invalidation. It will only architecturally
> invalidate multiple 4K entries when the hardware fractured them in the
> first place. I think we should probably take advantage of what INVLPGB
> can do instead of following the INVLPG approach.
> 
> INVLPGB will invalidate a range no matter where the underlying entries
> came from. Its "increment the virtual address at the 2M boundary" mode
> will invalidate entries of any size. That's my reading of the docs at
> least. Is that everyone else's reading too?

This is not my reading. I think that this reading assumes that besides
the broadcast, some new “range flush” was added to the TLB. My guess
is that this not the case, since presumably it would require a different
TLB structure (and who does 2 changes at once ;-) ).

My understanding is therefore that it’s all in microcode. There is a
“stride” and “number” which are used by the microcode for iterating
and on every iteration the microcode issues a TLB invalidation. This
invalidation is similar to INVLPG, just as it was always done (putting
aside the variants that do not invalidate the PWC). IOW, the page-size
is not given as part of the INVLPG and not as part of INVLPGB (regardless
of the stride) for whatever entries used to invalidate a given address.

I think my understanding is backed by the wording of "regardless of the
page size” appearing for INVLPG as well in AMD’s manual.

My guess is that invalidating more entries will take longer time, maybe
not on the sender, but at least on the receiver. I also guess that in
certain setups - big NUMA machines - INVLPGB might perform worse. I
remember vaguely ARM guys writing something about such behavior.