On Mon, Jan 10, 2022, at 7:10 PM, Nicholas Piggin wrote: > Excerpts from Andy Lutomirski's message of January 11, 2022 6:52 am: >> >> >> On Sun, Jan 9, 2022, at 8:56 PM, Nicholas Piggin wrote: >>> Excerpts from Linus Torvalds's message of January 10, 2022 7:51 am: >>>> [ Ugh, I actually went back and looked at Nick's patches again, to >>>> just verify my memory, and they weren't as pretty as I thought they >>>> were ] >>>> >>>> On Sun, Jan 9, 2022 at 12:48 PM Linus Torvalds >>>> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: >>>>> >>>>> I'd much rather have a *much* smaller patch that says "on x86 and >>>>> powerpc, we don't need this overhead at all". >>>> >>>> For some reason I thought Nick's patch worked at "last mmput" time and >>>> the TLB flush IPIs that happen at that point anyway would then make >>>> sure any lazy TLB is cleaned up. >>>> >>>> But that's not actually what it does. It ties the >>>> MMU_LAZY_TLB_REFCOUNT to an explicit TLB shootdown triggered by the >>>> last mmdrop() instead. Because it really tied the whole logic to the >>>> mm_count logic (and made lazy tlb to not do mm_count) rather than the >>>> mm_users thing I mis-remembered it doing. >>> >>> It does this because on powerpc with hash MMU, we can't use IPIs for >>> TLB shootdowns. >> >> I know nothing about powerpc’s mmu. If you can’t do IPI shootdowns, > > The paravirtualised hash MMU environment doesn't because it has a single > level translation and the guest uses hypercalls to insert and remove > translations and the hypervisor flushes TLBs. The HV could flush TLBs > with IPIs but obviously the guest can't use those to execute the lazy > switch. In radix guests (and all bare metal) the OS flushes its own > TLBs. > > We are moving over to radix, but powerpc also has a hardware broadcast > flush instruction which can be a bit faster than IPIs and is usable by > bare metal and radix guests so those can also avoid the IPIs if they > want. Part of the powerpc patch I just sent to combine the lazy switch > with the final TLB flush is to force it to always take the IPI path and > not use TLBIE instruction on the final exit. > > So hazard points could avoid some IPIs there too. > >> it sounds like the hazard pointer scheme might actually be pretty good. > > Some IPIs in the exit path just aren't that big a concern. I measured, > got numbers, tried to irritate it, just wasn't really a problem. Some > archs use IPIs for all threaded TLB shootdowns and exits not that > frequent. Very fast short lived processes that do a lot of exits just > don't tend to spread across a lot of CPUs leaving lazy tlb mms to shoot, > and long lived and multi threaded ones that do don't exit at high rates. > > So from what I can see it's premature optimization. Actually maybe not > even optimization because IIRC it adds complexity and even a barrier on > powerpc in the context switch path which is a lot more critical than > exit() for us we don't want slowdowns there. > > It's a pretty high complexity boutique kind of synchronization. Which > don't get me wrong is the kind of thing I like, it is clever and may be > perfectly bug free but it needs to prove itself over the simple dumb > shoot lazies approach. > >>>> So at least some of my arguments were based on me just mis-remembering >>>> what Nick's patch actually did (mainly because I mentally recreated >>>> the patch from "Nick did something like this" and what I thought would >>>> be the way to do it on x86). >>> >>> With powerpc with the radix MMU using IPI based shootdowns, we can >>> actually do the switch-away-from-lazy on the final TLB flush and the >>> final broadcast shootdown thing becomes a no-op. I didn't post that >>> additional patch because it's powerpc-specific and I didn't want to >>> post more code so widely. >>> >>>> So I guess I have to recant my arguments. >>>> >>>> I still think my "get rid of lazy at last mmput" model should work, >>>> and would be a perfect match for x86, but I can't really point to Nick >>>> having done that. >>>> >>>> So I was full of BS. >>>> >>>> Hmm. I'd love to try to actually create a patch that does that "Nick >>>> thing", but on last mmput() (ie when __mmput triggers). Because I >>>> think this is interesting. But then I look at my schedule for the >>>> upcoming week, and I go "I don't have a leg to stand on in this >>>> discussion, and I'm just all hot air". >>> >>> I agree Andy's approach is very complicated and adds more overhead than >>> necessary for powerpc, which is why I don't want to use it. I'm still >>> not entirely sure what the big problem would be to convert x86 to use >>> it, I admit I haven't kept up with the exact details of its lazy tlb >>> mm handling recently though. >> >> The big problem is the entire remainder of this series! If x86 is going to do shootdowns without mm_count, I want the result to work and be maintainable. A few of the issues that needed solving: >> >> - x86 tracks usage of the lazy mm on CPUs that have it loaded into the MMU, not CPUs that have it in active_mm. Getting this in sync needed core changes. > > Definitely should have been done at the time x86 deviated, but better > late than never. > I suspect that this code may predate there being anything for x86 to deviate from. >> >> - mmgrab and mmdrop are barriers, and core code relies on that. If we get rid of a bunch of calls (conditionally), we need to stop depending on the barriers. I fixed this. > > membarrier relied on a call that mmdrop was providing. Adding a smp_mb() > instead if mmdrop is a no-op is fine. Patches changing membarrier's > ordering requirements can be concurrent and are not fundmentally tied > to lazy tlb mm switching, it just reuses an existing ordering point. smp_mb() is rather expensive on x86 at least, and I think on powerpc to. Let's not. Also, IMO my cleanups here generally make sense and make the code better, so I think we should just go with them. > >> - There were too many mmgrab and mmdrop calls, and the call sites had different semantics and different refcounting rules (thanks, kthread). I cleaned this up. > > Seems like a decent cleanup. Again lazy tlb specific, just general switch > code should be factored and better contained in kernel/sched/ which is > fine, but concurrent to lazy tlb improvements. I personally rather dislike the model of: ... mmgrab_lazy(); ... mmdrop_lazy(); ... mmgrab_lazy(); ... mmdrop_lazy(); ... where the different calls have incompatible logic at the call sites and a magic config option NOPs them all away and adds barriers. I'm personally a big fan of cleaning up code before making it even more complicated. > >> - If we do a shootdown instead of a refcount, then, when exit() tears down its mm, we are lazily using *that* mm when we do the shootdowns. If active_mm continues to point to the being-freed mm and an NMI tries to dereference it, we’re toast. I fixed those issues. > > My shoot lazies patch has no such issues with that AFAIKS. What exact > issue is it and where did you fix it? Without my unlazy_mm_irqs_off() (or something similar), x86's very sensible (and very old!) code to unlazy a lazy CPU when flushing leaves active_mm pointing to the old lazy mm. That's not a problem at all on current kernels, but in a shootdown-lazy world, those active_mm pointers will stick around. Even with that fixed, without some care, an NMI during the shootdown CPU could dereference ->active_mm at a time when it's not being actively kept alive. Fixed by unlazy_mm_irqs_off(), the patches that use it, and the patches that make x86 stop inappropriately using ->active_mm. > >> >> - If we do a UEFI runtime service call while lazy or a text_poke while lazy and the mm goes away while this is happening, we would blow up. Refcounting prevents this but, in current kernels, a shootdown IPI on x86 would not prevent this. I fixed these issues (and removed duplicate code). >> >> My point here is that the current lazy mm code is a huge mess. 90% of the complexity in this series is cleaning up core messiness and x86 messiness. I would still like to get rid of ->active_mm entirely (it appears to serve no good purpose on any architecture), it that can be saved for later, I think. > > I disagree, the lazy mm code is very clean and simple. And I can't see > how you would propose to remove active_mm from core code I'm skeptical > but would be very interested to see, but that's nothing to do with my > shoot lazies patch and can also be concurrent except for mechanical > merge issues. I think we may just have to agree to disagree. The more I looked at the lazy code, the more problems I found. So I fixed them. That work is done now (as far as I'm aware) except for rebasing and review. --Andy