Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

Nadav Amit <nadav.amit@xxxxxxxxx> · Sun, 9 Jan 2022 12:34:15 -0800

> On Jan 9, 2022, at 11:52 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> 
> On Sun, Jan 9, 2022, at 11:10 AM, Linus Torvalds wrote:
>> On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>>> 
>>> I do not know whether it is a pure win, but there is a tradeoff.
>> 
>> Hmm. I guess only some serious testing would tell.
>> 
>> On x86, I'd be a bit worried about removing lazy TLB simply because
>> even with ASID support there (called PCIDs by Intel for NIH reasons),
>> the actual ASID space on x86 was at least originally very very
>> limited.
>> 
>> Architecturally, x86 may expose 12 bits of ASID space, but iirc at
>> least the first few implementations actually only internally had one
>> or two bits, and hashed the 12 bits down to that internal very limited
>> hardware TLB ID space.
>> 
>> We only use a handful of ASIDs per CPU on x86 partly for this reason
>> (but also since there's no remote hardware TLB shootdown, there's no
>> reason to have a bigger global ASID space, so ASIDs aren't _that_
>> common).
>> 
>> And I don't know how many non-PCID x86 systems (perhaps virtualized?)
>> there might be out there.
>> 
>> But it would be very interesting to test some "disable lazy tlb"
>> patch. The main problem workloads tend to be IO, and I'm not sure how
>> many of the automated performance tests would catch issues. I guess
>> some threaded pipe ping-pong test (with each thread pinned to
>> different cores) would show it.
> 
> My original PCID series actually did remove lazy TLB on x86.  I don't remember why, but people objected.  The issue isn't the limited PCID space -- IIRC it's just that MOV CR3 is slooooow.  If we get rid of lazy TLB on x86, then we are writing CR3 twice on even a very short idle.  That adds maybe 1k cycles, which isn't great.

Just for the record: I just ran a short test when CPUs are on max freq
on my skylake. MOV-CR3 without flush is 250-300 cycles. One can argue
that you mostly only care for one of the switches for the idle thread
(once you wake up). And waking up by itself has its overheads.

But you are the master of micro optimizations, and as Rik said, I
mostly think of TLB shootdowns and might miss the big picture. Just
trying to make your life easier by less coding and my life simpler
in understanding your super-smart code. ;-)