Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

"Andy Lutomirski" <luto@xxxxxxxxxx> · Sun, 09 Jan 2022 13:48:05 -0700

On Sun, Jan 9, 2022, at 1:34 PM, Nadav Amit wrote:
>> On Jan 9, 2022, at 11:52 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>> 
>> On Sun, Jan 9, 2022, at 11:10 AM, Linus Torvalds wrote:
>>> On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>>>> 
>>>> I do not know whether it is a pure win, but there is a tradeoff.
>>> 
>>> Hmm. I guess only some serious testing would tell.
>>> 
>>> On x86, I'd be a bit worried about removing lazy TLB simply because
>>> even with ASID support there (called PCIDs by Intel for NIH reasons),
>>> the actual ASID space on x86 was at least originally very very
>>> limited.
>>> 
>>> Architecturally, x86 may expose 12 bits of ASID space, but iirc at
>>> least the first few implementations actually only internally had one
>>> or two bits, and hashed the 12 bits down to that internal very limited
>>> hardware TLB ID space.
>>> 
>>> We only use a handful of ASIDs per CPU on x86 partly for this reason
>>> (but also since there's no remote hardware TLB shootdown, there's no
>>> reason to have a bigger global ASID space, so ASIDs aren't _that_
>>> common).
>>> 
>>> And I don't know how many non-PCID x86 systems (perhaps virtualized?)
>>> there might be out there.
>>> 
>>> But it would be very interesting to test some "disable lazy tlb"
>>> patch. The main problem workloads tend to be IO, and I'm not sure how
>>> many of the automated performance tests would catch issues. I guess
>>> some threaded pipe ping-pong test (with each thread pinned to
>>> different cores) would show it.
>> 
>> My original PCID series actually did remove lazy TLB on x86.  I don't remember why, but people objected.  The issue isn't the limited PCID space -- IIRC it's just that MOV CR3 is slooooow.  If we get rid of lazy TLB on x86, then we are writing CR3 twice on even a very short idle.  That adds maybe 1k cycles, which isn't great.
>
> Just for the record: I just ran a short test when CPUs are on max freq
> on my skylake. MOV-CR3 without flush is 250-300 cycles. One can argue
> that you mostly only care for one of the switches for the idle thread
> (once you wake up). And waking up by itself has its overheads.
>
> But you are the master of micro optimizations, and as Rik said, I
> mostly think of TLB shootdowns and might miss the big picture. Just
> trying to make your life easier by less coding and my life simpler
> in understanding your super-smart code. ;-)

As Rik pointed out, the mm_cpumask manipulation is also expensive if we get rid of lazy. Let me ponder how to do this nicely.