Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

"Andy Lutomirski" <luto@xxxxxxxxxx> · Sat, 08 Jan 2022 15:04:14 -0700

On Sat, Jan 8, 2022, at 12:22 PM, Linus Torvalds wrote:
> On Sat, Jan 8, 2022 at 8:44 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>
>> To improve scalability, this patch adds a percpu hazard pointer scheme to
>> keep lazily-used mms alive.  Each CPU has a single pointer to an mm that
>> must not be freed, and __mmput() checks the pointers belonging to all CPUs
>> that might be lazily using the mm in question.
>
> Ugh. This feels horribly fragile to me, and also looks like it makes
> some common cases potentially quite expensive for machines with large
> CPU counts if they don't do that mm_cpumask optimization - which in
> turn feels quite fragile as well.
>
> IOW, this just feels *complicated*.
>
> And I think it's overly so. I get the strong feeling that we could
> make the rules much simpler and more straightforward.
>
> For example, how about we make the rules be

There there, Linus, not everything is as simple^Wincapable as x86 bare metal, and mm_cpumask does not have useful cross-arch semantics.  Is that good?

>
>  - a lazy TLB mm reference requires that there's an actual active user
> of that mm (ie "mm_users > 0")
>
>  - the last mm_users decrement (ie __mmput) forces a TLB flush, and
> that TLB flush must make sure that no lazy users exist (which I think
> it does already anyway).

It does, on x86 bare metal, in exit_mmap().  It’s implicit, but it could be made explicit, as below.

>
> Doesn't that seem like a really simple set of rules?
>
> And the nice thing about it is that we *already* do that required TLB
> flush in all normal circumstances. __mmput() already calls
> exit_mmap(), and exit_mm() already forces that TLB flush in every
> normal situation.

Exactly. On x86 bare metal and similar architectures, this flush is done by IPI, which involves a loop over all CPUs that might be using the mm.  And other patches in this series add the core ability for x86 to shoot down the lazy TLB cleanly so the core drops its reference and wires it up for x86.

>
> So we might have to make sure that every architecture really does that
> "drop lazy mms on TLB flush", and maybe add a flag to the existing
> 'struct mmu_gather tlb' to make sure that flush actually always
> happens (even if the process somehow managed to unmap all vma's even
> before exiting).

So this requires that all architectures actually walk all relevant CPUs to see if an IPI is needed and send that IPI.  On architectures that actually need an IPI anyway (x86 bare metal, powerpc (I think) and others, fine. But on architectures with a broadcast-to-all-CPUs flush (ARM64 IIUC), then the extra IPI will be much much slower than a simple load-acquire in a loop.

In fact, arm64 doesn’t even track mm_cpumask at all last time I checked, so even an IPI lazy shoot down would require looping *all* CPUs, doing a load-acquire, and possibly doing an IPI. I much prefer doing a load-acquire and possibly a cmpxchg.

(And x86 PV can do hypercall flushes.  If a bunch of vCPUs are not running, an IPI shootdown will end up sleeping until they run, whereas this patch will allow the hypervisor to leave them asleep and thus to finish __mmput without waking them. This only matters on a CPU-oversubscribed host, but still.  And it kind of looks like hardware remote flushes are coming in AMD land eventually.)

But yes, I fully agree that this patch is complicated and subtle.

>
> Is there something silly I'm missing? Somebody pat me on the head, and
> say "There, there, Linus, don't try to get involved with things you
> don't understand.." and explain to me in small words.
>
>                   Linus