Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

"Andy Lutomirski" <luto@xxxxxxxxxx> · Sat, 08 Jan 2022 19:58:39 -0800

On Sat, Jan 8, 2022, at 4:53 PM, Linus Torvalds wrote:
> [ Let's try this again, without the html crud this time. Apologies to
> the people who see this reply twice ]
>
> On Sat, Jan 8, 2022 at 2:04 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>
>> So this requires that all architectures actually walk all relevant
>> CPUs to see if an IPI is needed and send that IPI. On architectures
>> that actually need an IPI anyway (x86 bare metal, powerpc (I think)
>> and others, fine. But on architectures with a broadcast-to-all-CPUs
>> flush (ARM64 IIUC), then the extra IPI will be much much slower than a
>> simple load-acquire in a loop.
>
> ... hmm. How about a hybrid scheme?
>
>  (a) architectures that already require that IPI anyway for TLB
> invalidation (ie x86, but others too), just make the rule be that the
> TLB flush by exit_mmap() get rid of any lazy TLB mm references. Which
> they already do.
>
>  (b) architectures like arm64 that do hw-assisted TLB shootdown will
> have an ASID allocator model, and what you do is to use that to either
>     (b') increment/decrement the mm_count at mm ASID allocation/freeing time
>     (b'') use the existing ASID tracking data to find the CPU's that
> have that ASID
>
>  (c) can you really imagine hardware TLB shootdown without ASID
> allocation? That doesn't seem to make sense. But if it exists, maybe
> that kind of crazy case would do the percpu array walking.
>

So I can go over a handful of TLB flush schemes:

1. x86 bare metal.  As noted, just plain shootdown would work.  (Unless we switch to inexact mm_cpumask() tracking, which might be enough of a win that it's worth it.)  Right now, "ASID" (i.e. PCID, thanks Intel) is allocated per cpu.  They are never explicitly freed -- they just expire off a percpu LRU.  The data structures have no idea whether an mm still exists -- instead they track mm->context.ctx_id, which is 64 bits and never reused.

2. x86 paravirt.  This is just like bare metal except there's a hypercall to flush a specific target cpu.  (I think this is mutually exclusive with PCID, but I'm not sure.  I haven't looked that hard.  I'm not sure exactly what is implemented right now.  It could be an operation to flush (cpu, pcid), but that gets awkward for reasons that aren't too relevant to this discussion.)  In this model, the exit_mmap() shootdown would either need to switch to a non-paravirt flush or we need a fancy mm_count solution of some sort.

3. Hypothetical better x86.  AMD has INVLPGB, which is almost useless right now.  But it's *so* close to being very useful, and I've asked engineers at AMD and Intel to improve this.  Specifically, I want PCID to be widened to 64 bits.  (This would, as I understand it, not affect the TLB hardware at all.  It would affect the tiny table that sits in front of the real PCID and maintains the illusion that PCID is 12 bits, and it would affect the MOV CR3 instruction.  The latter makes it complicated.)  And INVLPGB would invalidate a given 64-bit PCID system-wide.  In this model, there would be no such thing as freeing an ASID.  So I think we would want something very much like this patch.

4. ARM64.  I only barely understand it, but I think it's an intermediate scheme with ASIDs that are wide enough to be useful but narrow enough to run out on occasion.  I don't think they're tracked -- I think the whole world just gets invalidated when they overflow.  I could be wrong.

In any event, ASID lifetimes aren't a magic solution -- how do we know when to expire an ASID?  Presumably it would be when an mm is fully freed (__mmdrop), which gets us right back to square one.

In any case, what I particularly like about my patch is that, while it's subtle, it's subtle just once.  I think it can handle all the interesting arch cases by merely redefining for_each_possible_lazymm_cpu() to do the right thing.

> (Honesty in advertising: I don't know the arm64 ASID code - I used to
> know the old alpha version I wrote in a previous lifetime - but afaik
> any ASID allocator has to be able to track CPU's that have a
> particular ASID in use and be able to invalidate it).
>
> Hmm. The x86 maintainers are on this thread, but they aren't even the
> problem. Adding Catalin and Will to this, I think they should know
> if/how this would fit with the arm64 ASID allocator.
>

Well, I am an x86 mm maintainer, and there is definitely a performance problem on large x86 systems right now. :)

> Will/Catalin, background here:
>
>    
> https://lore.kernel.org/all/CAHk-=wj4LZaFB5HjZmzf7xLFSCcQri-WWqOEJHwQg0QmPRSdQA@xxxxxxxxxxxxxx/
>
> for my objection to that special "keep non-refcounted magic per-cpu
> pointer to lazy tlb mm".
>
>            Linus

--Andy