On Tue, Jan 11, 2022, at 2:39 AM, Will Deacon wrote: > Hi Andy, Linus, > > On Sun, Jan 09, 2022 at 12:48:42PM -0800, Linus Torvalds wrote: >> On Sun, Jan 9, 2022 at 12:20 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote: >> That said, I still don't actually know the arm64 ASID management code. > > That appears to be a common theme in this thread, so hopefully I can shed > some light on the arm64 side of things: > Thanks! > > FWIW, we have a TLA+ model of some of this, which may (or may not) be easier > to follow than my text: Yikes. Your fine hardware engineers should consider 64-bit ASIDs :) I suppose x86-on-AMD could copy this, but eww. OTOH x86 can easily have more CPUs than ASIDs, so maybe not. > > https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/asidalloc.tla > > although the syntax is pretty hard going :( > >> The thing about TLB flushes is that it's ok to do them spuriously (as >> long as you don't do _too_ many of them and tank performance), so two >> different mm's can have the same hw ASID on two different cores and >> that just makes cross-CPU TLB invalidates too aggressive. You can't >> share an ASID on the _same_ core without flushing in between context >> switches, because then the TLB on that core might be re-used for a >> different mm. So the flushing rules aren't necessarily 100% 1:1 with >> the "in use" rules, and who knows if the arm64 ASID management >> actually ends up just matching what that whole "this lazy TLB is still >> in use on another CPU". > > The shared TLBs (Arm calls this "Common-not-private") make this problematic, > as the TLB is no longer necessarily per-core. > >> So I don't really know the arm64 situation. And i's possible that lazy >> TLB isn't even worth it on arm64 in the first place. > > ASID allocation aside, I think there are a few useful things to point out > for arm64: > > - We only have "local" or "all" TLB invalidation; nothing targetted > (and for KVM guests this is always "all"). > > - Most mms end up running on more than one CPU (at least, when I > last looked at this a fork+exec would end up with the mm having > been installed on two CPUs) > > - We don't track mm_cpumask as it showed up as a bottleneck in the > past and, because of the earlier points, it wasn't very useful > anyway > > - mmgrab() should be fast for us (it's a posted atomic add), > although mmdrop() will be slower as it has to return data to > check against the count going to zero. > > So it doesn't feel like an obvious win to me for us to scan these new hazard > pointers on arm64. At least, I would love to see some numbers if we're going > to make changes here. I will table the hazard pointer scheme, then, and adjust the series to do shootdowns. I would guess that once arm64 hits a few hundred CPUs, you'll start finding workloads where mmdrop() at least starts to hurt. But we can cross that bridge when we get to it.