Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

"Andy Lutomirski" <luto@xxxxxxxxxx> · Mon, 10 Jan 2022 12:52:49 -0800

On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote:
> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>> [ Ugh, I actually went back and looked at Nick's patches again, to
>> just verify my memory, and they weren't as pretty as I thought they
>> were ]
>> 
>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>> powerpc, we don't need this overhead at all".
>> 
>> For some reason I thought Nick's patch worked at "last mmput" time and
>> the TLB flush IPIs that happen at that point anyway would then make
>> sure any lazy TLB is cleaned up.
>> 
>> But that's not actually what it does. It ties the
>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>> last mmdrop() instead. Because it really tied the whole logic to the
>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>> mm_users thing I mis-remembered it doing.
>
> It does this because on powerpc with hash MMU, we can't use IPIs for
> TLB shootdowns.

I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns, it sounds like the hazard pointer scheme might actually be pretty good.

>
>> So at least some of my arguments were based on me just mis-remembering
>> what Nick's patch actually did (mainly because I mentally recreated
>> the patch from "Nick did something like this" and what I thought would
>> be the way to do it on x86).
>
> With powerpc with the radix MMU using IPI based shootdowns, we can 
> actually do the switch-away-from-lazy on the final TLB flush and the
> final broadcast shootdown thing becomes a no-op. I didn't post that
> additional patch because it's powerpc-specific and I didn't want to
> post more code so widely.
>
>> So I guess I have to recant my arguments.
>> 
>> I still think my "get rid of lazy at last mmput" model should work,
>> and would be a perfect match for x86, but I can't really point to Nick
>> having done that.
>> 
>> So I was full of BS.
>> 
>> Hmm. I'd love to try to actually create a patch that does that "Nick
>> thing", but on last mmput() (ie when __mmput triggers). Because I
>> think this is interesting. But then I look at my schedule for the
>> upcoming week, and I go "I don't have a leg to stand on in this
>> discussion, and I'm just all hot air".
>
> I agree Andy's approach is very complicated and adds more overhead than
> necessary for powerpc, which is why I don't want to use it. I'm still
> not entirely sure what the big problem would be to convert x86 to use
> it, I admit I haven't kept up with the exact details of its lazy tlb
> mm handling recently though.

The big problem is the entire remainder of this series!  If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving:

- x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm.  Getting this in sync needed core changes.

- mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this.

- There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread).  I cleaned this up.

- If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues.

- If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this.  I fixed these issues (and removed duplicate code).

My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture),  it that can be saved for later, I think.