Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

Nicholas Piggin <npiggin@xxxxxxxxx> · Wed, 12 Jan 2022 08:48:22 +1000

Excerpts from Andy Lutomirski's message of January 12, 2022 1:39 am:
> 
> 
> On Mon, Jan 10, 2022, at 7:10 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of January 11, 2022 6:52 am:
>>> 
>>> 
>>> On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote:
>>>> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am:
>>>>> [ Ugh, I actually went back and looked at Nick's patches again, to
>>>>> just verify my memory, and they weren't as pretty as I thought they
>>>>> were ]
>>>>> 
>>>>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds
>>>>> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>> I'd much rather have a *much* smaller patch that says "on x86 and
>>>>>> powerpc, we don't need this overhead at all".
>>>>> 
>>>>> For some reason I thought Nick's patch worked at "last mmput" time and
>>>>> the TLB flush IPIs that happen at that point anyway would then make
>>>>> sure any lazy TLB is cleaned up.
>>>>> 
>>>>> But that's not actually what it does. It ties the
>>>>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the
>>>>> last mmdrop() instead. Because it really tied the whole logic to the
>>>>> mm_count logic (and made lazy tlb to not do mm_count) rather than the
>>>>> mm_users thing I mis-remembered it doing.
>>>>
>>>> It does this because on powerpc with hash MMU, we can't use IPIs for
>>>> TLB shootdowns.
>>> 
>>> I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns,
>>
>> The paravirtualised hash MMU environment doesn't because it has a single 
>> level translation and the guest uses hypercalls to insert and remove 
>> translations and the hypervisor flushes TLBs. The HV could flush TLBs
>> with IPIs but obviously the guest can't use those to execute the lazy
>> switch. In radix guests (and all bare metal) the OS flushes its own
>> TLBs.
>>
>> We are moving over to radix, but powerpc also has a hardware broadcast 
>> flush instruction which can be a bit faster than IPIs and is usable by 
>> bare metal and radix guests so those can also avoid the IPIs if they 
>> want. Part of the powerpc patch I just sent to combine the lazy switch 
>> with the final TLB flush is to force it to always take the IPI path and 
>> not use TLBIE instruction on the final exit.
>>
>> So hazard points could avoid some IPIs there too.
>>
>>> it sounds like the hazard pointer scheme might actually be pretty good.
>>
>> Some IPIs in the exit path just aren't that big a concern. I measured,
>> got numbers, tried to irritate it, just wasn't really a problem. Some
>> archs use IPIs for all threaded TLB shootdowns and exits not that
>> frequent. Very fast short lived processes that do a lot of exits just
>> don't tend to spread across a lot of CPUs leaving lazy tlb mms to shoot,
>> and long lived and multi threaded ones that do don't exit at high rates.
>>
>> So from what I can see it's premature optimization. Actually maybe not
>> even optimization because IIRC it adds complexity and even a barrier on
>> powerpc in the context switch path which is a lot more critical than
>> exit() for us we don't want slowdowns there.
>>
>> It's a pretty high complexity boutique kind of synchronization. Which
>> don't get me wrong is the kind of thing I like, it is clever and may be
>> perfectly bug free but it needs to prove itself over the simple dumb
>> shoot lazies approach.
>>
>>>>> So at least some of my arguments were based on me just mis-remembering
>>>>> what Nick's patch actually did (mainly because I mentally recreated
>>>>> the patch from "Nick did something like this" and what I thought would
>>>>> be the way to do it on x86).
>>>>
>>>> With powerpc with the radix MMU using IPI based shootdowns, we can 
>>>> actually do the switch-away-from-lazy on the final TLB flush and the
>>>> final broadcast shootdown thing becomes a no-op. I didn't post that
>>>> additional patch because it's powerpc-specific and I didn't want to
>>>> post more code so widely.
>>>>
>>>>> So I guess I have to recant my arguments.
>>>>> 
>>>>> I still think my "get rid of lazy at last mmput" model should work,
>>>>> and would be a perfect match for x86, but I can't really point to Nick
>>>>> having done that.
>>>>> 
>>>>> So I was full of BS.
>>>>> 
>>>>> Hmm. I'd love to try to actually create a patch that does that "Nick
>>>>> thing", but on last mmput() (ie when __mmput triggers). Because I
>>>>> think this is interesting. But then I look at my schedule for the
>>>>> upcoming week, and I go "I don't have a leg to stand on in this
>>>>> discussion, and I'm just all hot air".
>>>>
>>>> I agree Andy's approach is very complicated and adds more overhead than
>>>> necessary for powerpc, which is why I don't want to use it. I'm still
>>>> not entirely sure what the big problem would be to convert x86 to use
>>>> it, I admit I haven't kept up with the exact details of its lazy tlb
>>>> mm handling recently though.
>>> 
>>> The big problem is the entire remainder of this series!  If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving:
>>> 
>>> - x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm.  Getting this in sync needed core changes.
>>
>> Definitely should have been done at the time x86 deviated, but better 
>> late than never.
>>
> 
> I suspect that this code may predate there being anything for x86 to deviate from.

Interesting, active_mm came in 2.3.11 and x86's cpu_tlbstate[].active_mm 
2.3.43. Longer than I thought.

>>> 
>>> - mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this.
>>
>> membarrier relied on a call that mmdrop was providing. Adding a smp_mb()
>> instead if mmdrop is a no-op is fine. Patches changing membarrier's 
>> ordering requirements can be concurrent and are not fundmentally tied
>> to lazy tlb mm switching, it just reuses an existing ordering point.
> 
> smp_mb() is rather expensive on x86 at least, and I think on powerpc to.  Let's not.

You misunderstand me. If mmdrop is not there to provide the required 
ordering that membarrier needs, then the membarrier code does its own
smp_mb(). It's not _more_ expensive than before because the full barrier
from the mmdrop is gone.

> Also, IMO my cleanups here generally make sense and make the code better, so I think we should just go with them.

Sure if you can make the membarrier code better and avoid this ordering
requirement that's nice. This is orthogonal to the lazy tlb code though
and they can go in parallel (again aside from mechanical merge issues).

I'm not sure what you don't understand about this, that ordering is a
membarrier requirement, and it happens to be using an existing barrier
that the lazy mm code had anyway, which is perfectly fine and something
that is done all over the kernel in performance critical code.

>>> - There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread).  I cleaned this up.
>>
>> Seems like a decent cleanup. Again lazy tlb specific, just general switch
>> code should be factored and better contained in kernel/sched/ which is
>> fine, but concurrent to lazy tlb improvements.
> 
> I personally rather dislike the model of:
> 
> ...
> mmgrab_lazy();
> ...
> mmdrop_lazy();
> ...
> mmgrab_lazy();
> ...
> mmdrop_lazy();
> ...
> 
> where the different calls have incompatible logic at the call sites and a magic config option NOPs them all away and adds barriers.  I'm personally a big fan of cleaning up code before making it even more complicated.

Not sure what model that is though. Call sites don't have to know 
anything at all about the options or barriers. The rule is simply
that the lazy tlb mm reference is manipulated with the _lazy variants.

It was just mostly duplicated code. Again, can go in parallel and no
dependencies other than mechanical merge.

> 
>>
>>> - If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues.
>>
>> My shoot lazies patch has no such issues with that AFAIKS. What exact 
>> issue is it and where did you fix it?
> 
> Without my unlazy_mm_irqs_off() (or something similar), x86's very sensible (and very old!) code to unlazy a lazy CPU when flushing leaves active_mm pointing to the old lazy mm.  That's not a problem at all on current kernels, but in a shootdown-lazy world, those active_mm pointers will stick around.  Even with that fixed, without some care, an NMI during the shootdown CPU could dereference ->active_mm at a time when it's not being actively kept alive.
> 
> Fixed by unlazy_mm_irqs_off(), the patches that use it, and the patches that make x86 stop inappropriately using ->active_mm.

Oh you're talking about some x86 specific issue where you would have
problems if you didn't do the port properly? Don't tell me this is why
you've been nacking my patches for 15 months.

>>> - If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this.  I fixed these issues (and removed duplicate code).
>>> 
>>> My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture),  it that can be saved for later, I think.
>>
>> I disagree, the lazy mm code is very clean and simple. And I can't see 
>> how you would propose to remove active_mm from core code I'm skeptical
>> but would be very interested to see, but that's nothing to do with my
>> shoot lazies patch and can also be concurrent except for mechanical
>> merge issues.
> 
> I think we may just have to agree to disagree.  The more I looked at the lazy code, the more problems I found.  So I fixed them.  That work is done now (as far as I'm aware) except for rebasing and review.

I don't see any problems with the lazy tlb mm code outside arch/x86 or 
anything your series fixed with it aside from a bit of code duplication.

Anyway I will try to take a look over and review bits I can before the
5.18 merge window. For 5.17 my series has been ready to go for a year or
so and very small so let's merge that first since Linus wants to try go
with that approach rather than the refcount one.

Thanks,
Nick