On 1/10/25 08:08, Rik van Riel wrote: > On Fri, 2025-01-10 at 07:14 -0800, Dave Hansen wrote: ... >> I think the key thing we need to decide is whether to treat a single >> INVLPGB(stride=8) more like a single INVLPGB or eight INVLPGBs. >> Measuring a bunch of invalidation looks should tell us that. > > Would I be wrong to assume that the CPUs have > some optimizations built in to efficiently > execute an invalidation for "everything in a > PCID"? There's only a few bits in the actual TLBs to store the PCID (or VPID), roughly 3 on Intel. Then there's another structure to map between the architectural PCID and the 3 bits of actual hardware alias. That's what I know. The rest is pure speculation: All you have to do in theory is zap the one entry in the PCID=>HW mapping structure to invalidate a whole PCID. You don't need to run through the TLB itself to invalidate it. You need to do something else to make sure that the now-unused 3-bit hardware identifier gets reused at _some_ point, but there may be other tricks for that. > The "global invalidate" we send does not > zap everything in the TLB, but only the > translations for a single PCID. > > I suppose we should measure these things > at some point (after I do the other > cleanups?), because the CPUs may well have > made a bunch of optimizations that we > don't know about. IIRC, the "big" invalidation modes are pretty cheap to execute. Most of the cost comes from the TLB refill, not the flush itself. But there's no substitute for actually measuring it. There's some wonky stuff out there. The last time Andy L. went and looked at it, there were oddities like INVPCID's "Individual-address invalidation" and INVLPG having surprisingly different performance.