Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable

Andy Lutomirski <luto@xxxxxxxxxx> · Tue, 15 Jun 2021 17:14:01 -0700

On 6/14/21 5:55 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 15, 2021 2:20 am:
>> Replying to several emails at once...
>>

> 
> So the only documentation relating to the current active_mm value or 
> refcounting is that it may not match what the x86 specific code is 
> doing?
> 
> All this complexity you accuse me of adding is entirely in x86 code.
> On other architectures, it's very simple and understandable, and 
> documented. I don't know how else to explain this.

And the docs you referred me to will be *wrong* with your patches
applied.  They are your patches, and they break the semantics.

> 
>>>>>
>>>>>> With your patch applied:
>>>>>>
>>>>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>>>>  "mm_users" counter that is how many "real address space users" there are,
>>>>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>>>>  users) plus one if there are any real users.
>>>>>>
>>>>>> isn't even true any more.
>>>>>
>>>>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>>>>> change is hopefully reasonably documented?
>>
>> active_mm is *only* refcounting in the core code.  See below.
> 
> It's just not. It's passed in to switch_mm. Most architectures except 
> for x86 require this.
> 

Sorry, I was obviously blatantly wrong.  Let me say it differently.
active_mm does two things:

1. It keeps an mm alive via a refcounting scheme.

2. It passes a parameter to switch_mm() to remind the arch code what the
most recently switch-to mm was.

#2 is basically useless.  An architecture can handle *that* with a
percpu variable and two lines of code.

If you are getting rid of functionality #1 in the core code via a new
arch opt-out, please get rid of #2 as well.  *Especially* because, when
the arch asks the core code to stop refcounting active_mm, there is
absolutely nothing guaranteeing that the parameter that the core code
will pass to switch_mm() points to memory that hasn't been freed and
reused for another purpose.

>>>>> I might not have been clear. Core code doesn't need active_mm if 
>>>>> active_mm somehow goes away. I'm saying active_mm can't go away because
>>>>> it's needed to support (most) archs that do lazy tlb mm switching.
>>>>>
>>>>> The part I don't understand is when you say it can just go away. How? 
>>
>> #ifdef CONFIG_MMU_TLB_REFCOUNT
>> 	struct mm_struct *active_mm;
>> #endif
> 
> Thanks for returning the snark.

That wasn't intended to be snark.  It was a literal suggestion, and, in
fact, it's *exactly* what I'm asking you to do to fix your patches.

>> I don't understand what contract you're talking about.  The core code
>> maintains an active_mm counter and keeps that mm_struct from
>> disappearing.  That's *it*.  The core code does not care that active_mm
>> is active, and x86 provides evidence of that -- on x86,
>> current->active_mm may well be completely unused.
> 
> I already acknowledged archs can do their own thing under the covers if 
> they want.

No.

I am *not* going to write x86 patches that use your feature in a way
that will cause the core code to pass around a complete garbage pointer
to an mm_struct that is completely unreferenced and may well be deleted.
 Doing something private in arch code is one thing.  Doing something
that causes basic common sense to be violated in core code is another
thing entirely.

>>
>> static inline void do_switch_mm(struct task_struct *prev_task, ...)
>> {
>> #ifdef CONFIG_MMU_TLB_REFCOUNT
>> 	switch_mm(...);
>> #else
>> 	switch_mm(fewer parameters);
>> 	/* or pass NULL or whatever. */
>> #endif
>> }
> 
> And prev_task comes from active_mm, ergo core code requires the concept 
> of active_mm.

I don't see why this concept is hard.  We are literally quibbling about
this single line of core code in kernel/sched/core.c:

switch_mm_irqs_off(prev->active_mm, next->mm, next);

This is not rocket science.  There are any number of ways to fix it.
For example:

#ifdef CONFIG_MMU_TLB_REFCOUNT
	switch_mm_irqs_off(prev->active_mm, next->mm, next);
#else
	switch_mm_irqs_off(NULL, next->mm, next);
#endif

If you don't like the NULL, then make the signature depend on the config
option.

What you may not do is what your patch actually does:

switch_mm_irqs_off(random invalid pointer, next->mm, next);

Now maybe it works because powerpc's lifecycle rules happen to keep
active_mm alive, but I haven't verified it.  x86's lifecycle rules *do not*.

>>>
>>> That's understandable, but please redirect your objections to the proper 
>>> place. git blame suggests 3d28ebceaffab.
>>
>> Thanks for the snark.
> 
> Is it not true? I don't mean that one patch causing all the x86 
> complexity or even making the situation worse itself. But you seem to be 
> asking my series to do things that really apply to the x86 changes over
> the past few years that got us here.

With just my patch from 4.15 applied, task->active_mm points to an
actual mm_struct, and that mm_struct will not be freed early.  If I opt
x86 into your patch's new behavior, then task->active_mm may be freed.

akpm, please drop this series until it's fixed.  It's a core change to
better support arch usecases, but it's unnecessarily fragile, and there
is already an arch maintainer pointing out that it's inadequate to
robustly support arch usecases.  There is no reason to merge it in its
present state.