On Sun, Jan 9, 2022 at 12:20 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > Are you *sure*? The ASID management code on x86 is (as mentioned > before) completely unaware of whether an ASID is actually in use > anywhere. Right. But the ASID situation on x86 is very very different, exactly because x86 doesn't have cross-CPU TLB invalidates. Put another way: x86 TLB hardware is fundamentally per-cpu. As such, any ASID management is also per-cpu. That's fundamentally not true on arm64. And that's not some "arm64 implementation detail". That's fundamental to doing cross-CPU TLB invalidates in hardware. If your TLB invalidates act across CPU's, then the state they act on are also obviously across CPU's. So the ASID situation is fundamentally different depending on the hardware usage. On x86, TLB's are per-core, and on arm64 they are not, and that's reflected in our code too. As a result, on x86, each mm has a per-cpu ASID, and there's a small array of per-cpu "mm->asid" mappings. On arm, each mm has an asid, and it's allocated from a global asid space - so there is no need for that "mm->asid" mapping, because the asid is there in the mm, and it's shared across cpus. That said, I still don't actually know the arm64 ASID management code. The thing about TLB flushes is that it's ok to do them spuriously (as long as you don't do _too_ many of them and tank performance), so two different mm's can have the same hw ASID on two different cores and that just makes cross-CPU TLB invalidates too aggressive. You can't share an ASID on the _same_ core without flushing in between context switches, because then the TLB on that core might be re-used for a different mm. So the flushing rules aren't necessarily 100% 1:1 with the "in use" rules, and who knows if the arm64 ASID management actually ends up just matching what that whole "this lazy TLB is still in use on another CPU". So I don't really know the arm64 situation. And i's possible that lazy TLB isn't even worth it on arm64 in the first place. > > So I think that even for that hardware TLB shootdown case, your patch > > only adds overhead. > > The overhead is literally: > > exit_mmap(); > for each cpu still in mm_cpumask: > smp_load_acquire > > That's it, unless the mm is actually in use Ok, now do this for a machine with 1024 CPU's. And tell me it is "scalable". > On a very large arm64 system, I would believe there could be real overhead. But these very large systems are exactly the systems that currently ping-pong mm_count. Right. But I think your arguments against mm_count are questionable. I'd much rather have a *much* smaller patch that says "on x86 and powerpc, we don't need this overhead at all". And then the arm64 people can look at it and say "Yeah, we'll still do the mm_count thing", or maybe say "Yeah, we can solve it another way". And maybe the arm64 people actually say "Yeah, this hazard pointer thing is perfect for us". That still doesn't necessarily argue for it on an architecture that ends up serializing with an IPI anyway. Linus