On Sun, Jan 9, 2022, at 11:10 AM, Linus Torvalds wrote: > On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@xxxxxxxxx> wrote: >> >> I do not know whether it is a pure win, but there is a tradeoff. > > Hmm. I guess only some serious testing would tell. > > On x86, I'd be a bit worried about removing lazy TLB simply because > even with ASID support there (called PCIDs by Intel for NIH reasons), > the actual ASID space on x86 was at least originally very very > limited. > > Architecturally, x86 may expose 12 bits of ASID space, but iirc at > least the first few implementations actually only internally had one > or two bits, and hashed the 12 bits down to that internal very limited > hardware TLB ID space. > > We only use a handful of ASIDs per CPU on x86 partly for this reason > (but also since there's no remote hardware TLB shootdown, there's no > reason to have a bigger global ASID space, so ASIDs aren't _that_ > common). > > And I don't know how many non-PCID x86 systems (perhaps virtualized?) > there might be out there. > > But it would be very interesting to test some "disable lazy tlb" > patch. The main problem workloads tend to be IO, and I'm not sure how > many of the automated performance tests would catch issues. I guess > some threaded pipe ping-pong test (with each thread pinned to > different cores) would show it. My original PCID series actually did remove lazy TLB on x86. I don't remember why, but people objected. The issue isn't the limited PCID space -- IIRC it's just that MOV CR3 is slooooow. If we get rid of lazy TLB on x86, then we are writing CR3 twice on even a very short idle. That adds maybe 1k cycles, which isn't great. > > And I guess there is some load that triggered the original powerpc > patch by Nick&co, and that Andy has been using.. I don't own a big enough machine. The workloads I'm aware of with the problem have massively multithreaded programs using many CPUs, and transitions into and out of lazy mode ping-pong the cacheline. > > Anybody willing to cook up a patch and run some benchmarks? Perhaps > one that basically just replaces "set ->mm to NULL" with "set ->mm to > &init_mm" - so that the lazy TLB code is still *there*, but it never > triggers.. It would > > I think it's mainly 'copy_thread()' in kernel/fork.c and the 'init_mm' > initializer in mm/init-mm.c, but there's probably other things too > that have that knowledge of the special "tsk->mm = NULL" situation. I think, for a little test, we would leave all the mm == NULL code alone and just change the enter-lazy logic. On top of all the cleanups in this series, that would be trivial. > > Linus