Re: [patch 119/212] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

Andy Lutomirski <luto@xxxxxxxxxx> · Thu, 2 Sep 2021 22:11:19 -0700

On 9/2/21 5:46 PM, Nicholas Piggin wrote:
> Excerpts from Andrew Morton's message of September 3, 2021 8:53 am:
>> On Thu, 2 Sep 2021 15:50:03 -0700 Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>>> On Thu, Sep 2, 2021 at 3:29 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>>>
>>>> This pile is:
>>>>
>>>> Nacked-by: Andy Lutomirski <luto@xxxxxxxxxx>
>>>
>>> Can you specify exactly the range you want me to drop?
>>>
>>> I assume it's the four patches 117-120, ie
>>>
>>>   lazy tlb: introduce lazy mm refcount helper functions
>>>   lazy tlb: allow lazy tlb mm refcounting to be configurable
>>>   lazy tlb: shoot lazies, a non-refcounting lazy tlb option
>>>   powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
>>>
>>> but I just want to double-check before I do surgery on that series.
>>
>> Yes, those 4.
>>
>> Sorry, I missed that email thread...
>>
> 
> That's not reasonable. Andy has had complete misunderstandings about the
> series which seems to stem from x86's horrible hacks that have gone in
> has confused him.

The horrible hacks in question are almost exclusively in core code.
Here's a brief summary of the situation.

There's a messy interaction between mmget()/mmdrop() and membarrier.
membarrier currently depends on some mmget() and mmdrop() calls to be
full barriers.  You make membarrier keep working by putting an ifdef'd
smp_mb() in the core scheduler.  I clean up the code to make it work
independently of smp_mb() and therefore save the cost of the
unconditional barrier for non-membarrier-using programs.

Your series adds an option MMU_LAZY_TLB_REFCOUNT=n for architectures to
opt out of lazy TLB refcounting.  This is simply wrong.  Right now, the
core scheduler provides current->active_mm and guarantees that
current->active_mm always points to a live (possibly mm_users == 0 but
definitely not freed) mm_struct.  With MMU_LAZY_TLB_REFCOUNT=n,
current->active_mm still exists, is still updated, but may point to
freed memory.  I consider this unacceptable.  A comment says "This can
be disabled if the architecture ensures no CPUs are using an mm as a
"lazy tlb" beyond its final refcount" -- that's nice, but saying "well,
if you this, you have to make sure you don't accidentally dereference
that juicy dangling pointer we give you" is, in my book, a poor
justification.

I have no particular objection to the actual shoot lazies part, except
insofar as I think we can do even better (e.g. my patch).  But 90% of
the complexity of my WIP series is cleaning up the mess so that we can
have a maintainable lazy mm mechanism instead of expanding the current
hard-to-maintain part into three separate possible modes.

Maybe I'm holding my own patches to an excessively high standard.

> 
> My series doesn't affect x86 at all and it's no reason why Andy's series
> to improve x86 can't be merged later. But that half finished series he 
> keeps threatening with has been sitting there for almost a year now and 
> it's gone nowhere, while there have been no unresolved technical 
> objections to mine, it works, it's simple and small.

My series barely touches x86.  The only "hack" is that x86 may have a
CPU that has ->mm == NULL, ->active_mm != NULL, CR3 pointing to the init
pgd, and mm_cpumask clear.  I don't see why this is a problem other than
being somewhat unusual.  But x86 bare metal, like every architecture
that can only flush the TLB using an IPI, can very efficiently shoot
lazies, since it shoots the lazies anyway when tearing down pagetables,
but actually enabling the config option with this series applied will
result in ->active_mm pointing to freed memory.  Ick.

> 
> I've kept trying to offer to help Andy with reviewing his stuff or fix 
> the horrible x86 hacks, but nothing.

I haven't finished it yet.  Sorry.