Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

Nicholas Piggin <npiggin@xxxxxxxxx> · Tue, 11 Jan 2022 13:10:09 +1000

Excerpts from Andy Lutomirski's message of January 11, 2022 6:52 am:
> 
> 
> On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote:
>> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>>> [ Ugh, I actually went back and looked at Nick's patches again, to
>>> just verify my memory, and they weren't as pretty as I thought they
>>> were ]
>>> 
>>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>>> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>>> powerpc, we don't need this overhead at all".
>>> 
>>> For some reason I thought Nick's patch worked at "last mmput" time and
>>> the TLB flush IPIs that happen at that point anyway would then make
>>> sure any lazy TLB is cleaned up.
>>> 
>>> But that's not actually what it does. It ties the
>>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>>> last mmdrop() instead. Because it really tied the whole logic to the
>>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>>> mm_users thing I mis-remembered it doing.
>>
>> It does this because on powerpc with hash MMU, we can't use IPIs for
>> TLB shootdowns.
> 
> I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns,

The paravirtualised hash MMU environment doesn't because it has a single 
level translation and the guest uses hypercalls to insert and remove 
translations and the hypervisor flushes TLBs. The HV could flush TLBs
with IPIs but obviously the guest can't use those to execute the lazy
switch. In radix guests (and all bare metal) the OS flushes its own
TLBs.

We are moving over to radix, but powerpc also has a hardware broadcast 
flush instruction which can be a bit faster than IPIs and is usable by 
bare metal and radix guests so those can also avoid the IPIs if they 
want. Part of the powerpc patch I just sent to combine the lazy switch 
with the final TLB flush is to force it to always take the IPI path and 
not use TLBIE instruction on the final exit.

So hazard points could avoid some IPIs there too.

> it sounds like the hazard pointer scheme might actually be pretty good.

Some IPIs in the exit path just aren't that big a concern. I measured,
got numbers, tried to irritate it, just wasn't really a problem. Some
archs use IPIs for all threaded TLB shootdowns and exits not that
frequent. Very fast short lived processes that do a lot of exits just
don't tend to spread across a lot of CPUs leaving lazy tlb mms to shoot,
and long lived and multi threaded ones that do don't exit at high rates.

So from what I can see it's premature optimization. Actually maybe not
even optimization because IIRC it adds complexity and even a barrier on
powerpc in the context switch path which is a lot more critical than
exit() for us we don't want slowdowns there.

It's a pretty high complexity boutique kind of synchronization. Which
don't get me wrong is the kind of thing I like, it is clever and may be
perfectly bug free but it needs to prove itself over the simple dumb
shoot lazies approach.

>>> So at least some of my arguments were based on me just mis-remembering
>>> what Nick's patch actually did (mainly because I mentally recreated
>>> the patch from "Nick did something like this" and what I thought would
>>> be the way to do it on x86).
>>
>> With powerpc with the radix MMU using IPI based shootdowns, we can 
>> actually do the switch-away-from-lazy on the final TLB flush and the
>> final broadcast shootdown thing becomes a no-op. I didn't post that
>> additional patch because it's powerpc-specific and I didn't want to
>> post more code so widely.
>>
>>> So I guess I have to recant my arguments.
>>> 
>>> I still think my "get rid of lazy at last mmput" model should work,
>>> and would be a perfect match for x86, but I can't really point to Nick
>>> having done that.
>>> 
>>> So I was full of BS.
>>> 
>>> Hmm. I'd love to try to actually create a patch that does that "Nick
>>> thing", but on last mmput() (ie when __mmput triggers). Because I
>>> think this is interesting. But then I look at my schedule for the
>>> upcoming week, and I go "I don't have a leg to stand on in this
>>> discussion, and I'm just all hot air".
>>
>> I agree Andy's approach is very complicated and adds more overhead than
>> necessary for powerpc, which is why I don't want to use it. I'm still
>> not entirely sure what the big problem would be to convert x86 to use
>> it, I admit I haven't kept up with the exact details of its lazy tlb
>> mm handling recently though.
> 
> The big problem is the entire remainder of this series!  If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving:
> 
> - x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm.  Getting this in sync needed core changes.

Definitely should have been done at the time x86 deviated, but better 
late than never.

> 
> - mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this.

membarrier relied on a call that mmdrop was providing. Adding a smp_mb()
instead if mmdrop is a no-op is fine. Patches changing membarrier's 
ordering requirements can be concurrent and are not fundmentally tied
to lazy tlb mm switching, it just reuses an existing ordering point.

> - There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread).  I cleaned this up.

Seems like a decent cleanup. Again lazy tlb specific, just general switch
code should be factored and better contained in kernel/sched/ which is
fine, but concurrent to lazy tlb improvements.

> - If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues.

My shoot lazies patch has no such issues with that AFAIKS. What exact 
issue is it and where did you fix it?

> 
> - If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this.  I fixed these issues (and removed duplicate code).
> 
> My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture),  it that can be saved for later, I think.

I disagree, the lazy mm code is very clean and simple. And I can't see 
how you would propose to remove active_mm from core code I'm skeptical
but would be very interested to see, but that's nothing to do with my
shoot lazies patch and can also be concurrent except for mechanical
merge issues.

Thanks,
Nick