Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

"Andy Lutomirski" <luto@xxxxxxxxxx> · Sun, 09 Jan 2022 11:52:52 -0800

On Sun, Jan 9, 2022, at 11:10 AM, Linus Torvalds wrote:
> On Sun, Jan 9, 2022 at 12:49 AM Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>>
>> I do not know whether it is a pure win, but there is a tradeoff.
>
> Hmm. I guess only some serious testing would tell.
>
> On x86, I'd be a bit worried about removing lazy TLB simply because
> even with ASID support there (called PCIDs by Intel for NIH reasons),
> the actual ASID space on x86 was at least originally very very
> limited.
>
> Architecturally, x86 may expose 12 bits of ASID space, but iirc at
> least the first few implementations actually only internally had one
> or two bits, and hashed the 12 bits down to that internal very limited
> hardware TLB ID space.
>
> We only use a handful of ASIDs per CPU on x86 partly for this reason
> (but also since there's no remote hardware TLB shootdown, there's no
> reason to have a bigger global ASID space, so ASIDs aren't _that_
> common).
>
> And I don't know how many non-PCID x86 systems (perhaps virtualized?)
> there might be out there.
>
> But it would be very interesting to test some "disable lazy tlb"
> patch. The main problem workloads tend to be IO, and I'm not sure how
> many of the automated performance tests would catch issues. I guess
> some threaded pipe ping-pong test (with each thread pinned to
> different cores) would show it.

My original PCID series actually did remove lazy TLB on x86.  I don't remember why, but people objected.  The issue isn't the limited PCID space -- IIRC it's just that MOV CR3 is slooooow.  If we get rid of lazy TLB on x86, then we are writing CR3 twice on even a very short idle.  That adds maybe 1k cycles, which isn't great.

>
> And I guess there is some load that triggered the original powerpc
> patch by Nick&co, and that Andy has been using..

I don't own a big enough machine.  The workloads I'm aware of with the problem have massively multithreaded programs using many CPUs, and transitions into and out of lazy mode ping-pong the cacheline.

>
> Anybody willing to cook up a patch and run some benchmarks? Perhaps
> one that basically just replaces "set ->mm to NULL" with "set ->mm to
> &init_mm" - so that the lazy TLB code is still *there*, but it never
> triggers..

It would 

>
> I think it's mainly 'copy_thread()' in kernel/fork.c and the 'init_mm'
> initializer in mm/init-mm.c, but there's probably other things too
> that have that knowledge of the special "tsk->mm = NULL" situation.

I think, for a little test, we would leave all the mm == NULL code alone and just change the enter-lazy logic.  On top of all the cleanups in this series, that would be trivial.

>
>                   Linus