On Sun, Jan 9, 2022, at 1:34 PM, Nadav Amit wrote: >> On Jan 9, 2022, at 11:52 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote: >> >> On Sun, Jan 9, 2022, at 11:10 AM, Linus Torvalds wrote: >>> On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@xxxxxxxxx> wrote: >>>> >>>> I do not know whether it is a pure win, but there is a tradeoff. >>> >>> Hmm. I guess only some serious testing would tell. >>> >>> On x86, I'd be a bit worried about removing lazy TLB simply because >>> even with ASID support there (called PCIDs by Intel for NIH reasons), >>> the actual ASID space on x86 was at least originally very very >>> limited. >>> >>> Architecturally, x86 may expose 12 bits of ASID space, but iirc at >>> least the first few implementations actually only internally had one >>> or two bits, and hashed the 12 bits down to that internal very limited >>> hardware TLB ID space. >>> >>> We only use a handful of ASIDs per CPU on x86 partly for this reason >>> (but also since there's no remote hardware TLB shootdown, there's no >>> reason to have a bigger global ASID space, so ASIDs aren't _that_ >>> common). >>> >>> And I don't know how many non-PCID x86 systems (perhaps virtualized?) >>> there might be out there. >>> >>> But it would be very interesting to test some "disable lazy tlb" >>> patch. The main problem workloads tend to be IO, and I'm not sure how >>> many of the automated performance tests would catch issues. I guess >>> some threaded pipe ping-pong test (with each thread pinned to >>> different cores) would show it. >> >> My original PCID series actually did remove lazy TLB on x86. I don't remember why, but people objected. The issue isn't the limited PCID space -- IIRC it's just that MOV CR3 is slooooow. If we get rid of lazy TLB on x86, then we are writing CR3 twice on even a very short idle. That adds maybe 1k cycles, which isn't great. > > Just for the record: I just ran a short test when CPUs are on max freq > on my skylake. MOV-CR3 without flush is 250-300 cycles. One can argue > that you mostly only care for one of the switches for the idle thread > (once you wake up). And waking up by itself has its overheads. > > But you are the master of micro optimizations, and as Rik said, I > mostly think of TLB shootdowns and might miss the big picture. Just > trying to make your life easier by less coding and my life simpler > in understanding your super-smart code. ;-) As Rik pointed out, the mm_cpumask manipulation is also expensive if we get rid of lazy. Let me ponder how to do this nicely.