Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

"Andy Lutomirski" <luto@xxxxxxxxxx> · Sun, 09 Jan 2022 12:19:31 -0800

On Sat, Jan 8, 2022, at 8:38 PM, Linus Torvalds wrote:
> On Sat, Jan 8, 2022 at 7:59 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>
>> > Hmm. The x86 maintainers are on this thread, but they aren't even the
>> > problem. Adding Catalin and Will to this, I think they should know
>> > if/how this would fit with the arm64 ASID allocator.
>> >
>>
>> Well, I am an x86 mm maintainer, and there is definitely a performance problem on large x86 systems right now. :)
>
> Well, my point was that on x86, the complexities of the patch you
> posted are completely pointless.
>
> So on x86, you can just remove the mmgrab/mmdrop reference counts from
> the lazy mm use entirely, and voila, that performance problem is gone.
> We don't _need_ reference counting on x86 at all, if we just say that
> the rule is that a lazy mm is always associated with a
> honest-to-goodness live mm.
>
> So on x86 - and any platform with the IPI model - there is no need for
> hundreds of lines of complexity at all.
>
> THAT is my point. Your patch adds complexity that buys you ABSOLUTELY NOTHING.
>
> You then saying that the mmgrab/mmdrop is a performance problem is
> just trying to muddy the water. You can just remove it entirely.
>
> Now, I do agree that that depends on the whole "TLB IPI will get rid
> of any lazy mm users on other cpus". So I agree that if you have
> hardware TLB invalidation that then doesn't have that software
> component to it, you need something else.
>
> But my argument _then_ was that hardware TLB invalidation then needs
> the hardware ASID thing to be useful, and the ASID management code
> already effectively keeps track of "this ASID is used on other CPU's".
> And that's exactly the same kind of information that your patch
> basically added a separate percpu array for.
>

Are you *sure*?  The ASID management code on x86 is (as mentioned before) completely unaware of whether an ASID is actually in use anywhere.  The x86 ASID code is a per-cpu LRU -- it tracks whether an mm has been recently used on a cpu, not whether the mm exists.  If an mm hasn't been used recently, the ASID gets recycled.  If we had more bits, we wouldn't even recycle it.  An ASID can and does get recycled while the mm still exists.

> So I think that even for that hardware TLB shootdown case, your patch
> only adds overhead.

The overhead is literally:

exit_mmap();
for each cpu still in mm_cpumask:
  smp_load_acquire

That's it, unless the mm is actually in use, in which 

>
> And it potentially adds a *LOT* of overhead, if you replace an atomic
> refcount with a "for_each_possible_cpu()" loop that has to do cmpxchg
> things too.

The cmpxchg is only in the case in which the mm is actually in use on that CPU.  I'm having trouble imagining a workload in which the loop is even measurable unless the bit scan itself is somehow problematic.

On a very large arm64 system, I would believe there could be real overhead.  But these very large systems are exactly the systems that currently ping-pong mm_count.

>
> Btw, you don't even need to really solve the arm64 TLB invalidate
> thing - we could make the rule be that we only do the mmgrab/mmput at
> all on platforms that don't do that IPI flush.
>
> I think that's basically exactly what Nick Piggin wanted to do on powerpc, no?

More or less, but...

>
> But you hated that patch, for non-obvious reasons, and are now
> introducing this new patch that is clearly non-optimal on x86.

I hated that patch because it's messy and it leaves the core lazy handling in an IMO quite regrettable state, not because I'm particularly opposed to shooting down lazies on platforms where it makes sense (powerpc and mostly x86).

As just the most obvious issue, note the kasan_check_byte() in this patch that verifies that ->active_mm doesn't point to freed memory when the scheduler is entered.  If we flipped shoot-lazies on on x86, then KASAN would blow up with that.

For perspective, this whole series is 23 patches.  Exactly two of them are directly related to my hazard pointer scheme: patches 16 and 22.  The rest of them are, in my opinion, cleanups and some straight-up bugfixes that are worthwhile no matter what we do with lazy mm handling per se.

>
> So I think there's some intellectual dishonesty on your part here.

I never said I hated shoot-lazies.  I didn't like the *code*.  I thought I could do better, and I still think my hazard pointer scheme is nifty and, aside from some complexity, quite nice, and it even reduces to shoot-lazies if for_each_possible_lazymm_cpu() is defined to do nothing, but I mainly wanted the code to be better.  So I went and did it.  I could respin this series without the hazard pointers quite easily.

--Andy