Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 9 Jan 2022 12:48:42 -0800

On Sun, Jan 9, 2022 at 12:20 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>
> Are you *sure*? The ASID management code on x86 is (as mentioned
> before) completely unaware of whether an ASID is actually in use
> anywhere.

Right.

But the ASID situation on x86 is very very different, exactly because
x86 doesn't have cross-CPU TLB invalidates.

Put another way: x86 TLB hardware is fundamentally per-cpu. As such,
any ASID management is also per-cpu.

That's fundamentally not true on arm64.  And that's not some "arm64
implementation detail". That's fundamental to doing cross-CPU TLB
invalidates in hardware.

If your TLB invalidates act across CPU's, then the state they act on
are also obviously across CPU's.

So the ASID situation is fundamentally different depending on the
hardware usage. On x86, TLB's are per-core, and on arm64 they are not,
and that's reflected in our code too.

As a result, on x86, each mm has a per-cpu ASID, and there's a small
array of per-cpu "mm->asid" mappings.

On arm, each mm has an asid, and it's allocated from a global asid
space - so there is no need for that "mm->asid" mapping, because the
asid is there in the mm, and it's shared across cpus.

That said, I still don't actually know the arm64 ASID management code.

The thing about TLB flushes is that it's ok to do them spuriously (as
long as you don't do _too_ many of them and tank performance), so two
different mm's can have the same hw ASID on two different cores and
that just makes cross-CPU TLB invalidates too aggressive. You can't
share an ASID on the _same_ core without flushing in between context
switches, because then the TLB on that core might be re-used for a
different mm. So the flushing rules aren't necessarily 100% 1:1 with
the "in use" rules, and who knows if the arm64 ASID management
actually ends up just matching what that whole "this lazy TLB is still
in use on another CPU".

So I don't really know the arm64 situation. And i's possible that lazy
TLB isn't even worth it on arm64 in the first place.

> > So I think that even for that hardware TLB shootdown case, your patch
> > only adds overhead.
>
> The overhead is literally:
>
> exit_mmap();
> for each cpu still in mm_cpumask:
>   smp_load_acquire
>
> That's it, unless the mm is actually in use

Ok, now do this for a machine with 1024 CPU's.

And tell me it is "scalable".

> On a very large arm64 system, I would believe there could be real overhead.  But these very large systems are exactly the systems that currently ping-pong mm_count.

Right.

But I think your arguments against mm_count are questionable.

I'd much rather have a *much* smaller patch that says "on x86 and
powerpc, we don't need this overhead at all".

And then the arm64 people can look at it and say "Yeah, we'll still do
the mm_count thing", or maybe say "Yeah, we can solve it another way".

And maybe the arm64 people actually say "Yeah, this hazard pointer
thing is perfect for us". That still doesn't necessarily argue for it
on an architecture that ends up serializing with an IPI anyway.

                Linus