Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

"Andy Lutomirski" <luto@xxxxxxxxxx> · Tue, 11 Jan 2022 07:22:54 -0800

On Tue, Jan 11, 2022, at 2:39 AM, Will Deacon wrote:
> Hi Andy, Linus,
>
> On Sun, Jan 09, 2022 at 12:48:42PM -0800, Linus Torvalds wrote:
>> On Sun, Jan 9, 2022 at 12:20 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote:

>> That said, I still don't actually know the arm64 ASID management code.
>
> That appears to be a common theme in this thread, so hopefully I can shed
> some light on the arm64 side of things:
>

Thanks!

>
> FWIW, we have a TLA+ model of some of this, which may (or may not) be easier
> to follow than my text:

Yikes. Your fine hardware engineers should consider 64-bit ASIDs :)

I suppose x86-on-AMD could copy this, but eww.  OTOH x86 can easily have more CPUs than ASIDs, so maybe not.

>
> https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/asidalloc.tla
>
> although the syntax is pretty hard going :(
>
>> The thing about TLB flushes is that it's ok to do them spuriously (as
>> long as you don't do _too_ many of them and tank performance), so two
>> different mm's can have the same hw ASID on two different cores and
>> that just makes cross-CPU TLB invalidates too aggressive. You can't
>> share an ASID on the _same_ core without flushing in between context
>> switches, because then the TLB on that core might be re-used for a
>> different mm. So the flushing rules aren't necessarily 100% 1:1 with
>> the "in use" rules, and who knows if the arm64 ASID management
>> actually ends up just matching what that whole "this lazy TLB is still
>> in use on another CPU".
>
> The shared TLBs (Arm calls this "Common-not-private") make this problematic,
> as the TLB is no longer necessarily per-core.
>
>> So I don't really know the arm64 situation. And i's possible that lazy
>> TLB isn't even worth it on arm64 in the first place.
>
> ASID allocation aside, I think there are a few useful things to point out
> for arm64:
>
> 	- We only have "local" or "all" TLB invalidation; nothing targetted
> 	  (and for KVM guests this is always "all").
>
> 	- Most mms end up running on more than one CPU (at least, when I
> 	  last looked at this a fork+exec would end up with the mm having
> 	  been installed on two CPUs)
>
> 	- We don't track mm_cpumask as it showed up as a bottleneck in the
> 	  past and, because of the earlier points, it wasn't very useful
> 	  anyway
>
> 	- mmgrab() should be fast for us (it's a posted atomic add),
> 	  although mmdrop() will be slower as it has to return data to
> 	  check against the count going to zero.
>
> So it doesn't feel like an obvious win to me for us to scan these new hazard
> pointers on arm64. At least, I would love to see some numbers if we're going
> to make changes here.

I will table the hazard pointer scheme, then, and adjust the series to do shootdowns.

I would guess that once arm64 hits a few hundred CPUs, you'll start finding workloads where mmdrop() at least starts to hurt.  But we can cross that bridge when we get to it.