> On 9 Jan 2025, at 23:18, Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > > But actually I think INVLPGB is *WAY* better than INVLPG here. INVLPG > doesn't have ranged invalidation. It will only architecturally > invalidate multiple 4K entries when the hardware fractured them in the > first place. I think we should probably take advantage of what INVLPGB > can do instead of following the INVLPG approach. > > INVLPGB will invalidate a range no matter where the underlying entries > came from. Its "increment the virtual address at the 2M boundary" mode > will invalidate entries of any size. That's my reading of the docs at > least. Is that everyone else's reading too? This is not my reading. I think that this reading assumes that besides the broadcast, some new “range flush” was added to the TLB. My guess is that this not the case, since presumably it would require a different TLB structure (and who does 2 changes at once ;-) ). My understanding is therefore that it’s all in microcode. There is a “stride” and “number” which are used by the microcode for iterating and on every iteration the microcode issues a TLB invalidation. This invalidation is similar to INVLPG, just as it was always done (putting aside the variants that do not invalidate the PWC). IOW, the page-size is not given as part of the INVLPG and not as part of INVLPGB (regardless of the stride) for whatever entries used to invalidate a given address. I think my understanding is backed by the wording of "regardless of the page size” appearing for INVLPG as well in AMD’s manual. My guess is that invalidating more entries will take longer time, maybe not on the sender, but at least on the receiver. I also guess that in certain setups - big NUMA machines - INVLPGB might perform worse. I remember vaguely ARM guys writing something about such behavior.