On Sat, Jan 8, 2022 at 7:59 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > > Hmm. The x86 maintainers are on this thread, but they aren't even the > > problem. Adding Catalin and Will to this, I think they should know > > if/how this would fit with the arm64 ASID allocator. > > > > Well, I am an x86 mm maintainer, and there is definitely a performance problem on large x86 systems right now. :) Well, my point was that on x86, the complexities of the patch you posted are completely pointless. So on x86, you can just remove the mmgrab/mmdrop reference counts from the lazy mm use entirely, and voila, that performance problem is gone. We don't _need_ reference counting on x86 at all, if we just say that the rule is that a lazy mm is always associated with a honest-to-goodness live mm. So on x86 - and any platform with the IPI model - there is no need for hundreds of lines of complexity at all. THAT is my point. Your patch adds complexity that buys you ABSOLUTELY NOTHING. You then saying that the mmgrab/mmdrop is a performance problem is just trying to muddy the water. You can just remove it entirely. Now, I do agree that that depends on the whole "TLB IPI will get rid of any lazy mm users on other cpus". So I agree that if you have hardware TLB invalidation that then doesn't have that software component to it, you need something else. But my argument _then_ was that hardware TLB invalidation then needs the hardware ASID thing to be useful, and the ASID management code already effectively keeps track of "this ASID is used on other CPU's". And that's exactly the same kind of information that your patch basically added a separate percpu array for. So I think that even for that hardware TLB shootdown case, your patch only adds overhead. And it potentially adds a *LOT* of overhead, if you replace an atomic refcount with a "for_each_possible_cpu()" loop that has to do cmpxchg things too. Now, on x86, where we maintain that mm_cpumask, and as a result that overhead is much lower - but we maintain that mm_cpumask exactly *because* we do that IPI thing, so I don't think you can use that argument in favor of your patch. When we do the IPI thing, your patch is worthless overhead. See? Btw, you don't even need to really solve the arm64 TLB invalidate thing - we could make the rule be that we only do the mmgrab/mmput at all on platforms that don't do that IPI flush. I think that's basically exactly what Nick Piggin wanted to do on powerpc, no? But you hated that patch, for non-obvious reasons, and are now introducing this new patch that is clearly non-optimal on x86. So I think there's some intellectual dishonesty on your part here. Linus