Re: [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes

Hugh Dickins <hughd@xxxxxxxxxx> · Sat, 12 Mar 2011 11:40:11 -0800

Hi Oleg,

On Sat, Mar 12, 2011 at 5:43 AM, Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
> On 03/12, Oleg Nesterov wrote:
>>
>> On 03/11, David Rientjes wrote:
>> >
>> > On Wed, 9 Mar 2011, Andrew Morton wrote:
>> >
>> > > If Oleg's test program cause a hang with
>> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't
>> > > cause a hang without
>> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a
>> > > big problem for
>> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no?
>> > >
>> >
>> > It's a problem, but not because of
>> > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch.
>>
>> It is, afaics. oom-killer can't ussume that a single PF_EXITING && p->mm
>> thread is going to free the memory.
>>
>> > If we don't
>> > have this patch, then we have a trivial panic when an oom kill occurs in a
>> > cpuset with no other eligible processes, the oom killed thread group
>> > leader exits
>>
>> It is not clear what "leader exits" actually mean. OK, perhaps you mean
>> its ->mm == NULL.
>>
>> > but its other threads do not and they trigger oom kills
>> > themselves. Âfor_each_process() does not iterate over these threads and so
>> > it finds no eligible threads to kill and then panics
>>
>> Could you explain what do you mean? No need to kill these threads, they
>> are already killed, we should wait until they all exit.
>>
>> > I'll look at Oleg's test case
>> > and see what can be done to fix that condition, but the answer isn't to
>> > ignore eligible threads that can be killed.
>>
>> Once again, they are already killed. Or I do not understand what you meant.
>>
>> Could you please explain the problem in more details?
>>
>>
>> Also. Could you please look at the patches I sent?
>>
>> Â Â Â [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE
>> Â Â Â [PATCH v2 1/1] select_bad_process: improve the PF_EXITING check
>
> Cough. And both were not right, while_each_thread(p, t) needs the properly
> initialized "t". At least I warned they were not tested ;)
>
>> Note also the note about "p == current" check. it should be fixed too.
>
> I am resending the fixes above plus the new one.

I've spent much of the week building up to join in, but the more I
look around, the more I find to say or investigate, and therefore
never quite get to write the mail.  Let this be a placeholder, that I
probably disagree (in an amicable way!) with all of you, and maybe
I'll finally manage to collect my thoughts into mail later today.

I guess my main point will be that TIF_MEMDIE serves a number of
slightly different, even conflicting, purposes; and one of those
purposes, which present company seems to ignore repeatedly, is to
serialize access to final reserves of memory - as a comment by Nick in
select_bad_process() makes clear. (This is distinct from the
serialization to avoid OOM-killing rampage.)

We _might_ choose to abandon that, but if so, it should be a decision,
not an oversight.  So I cannot blindly agree with just setting
TIF_MEMDIE on more and more tasks, even if they share the same mm.  I
wonder if use of your find_lock_task_mm() in   select_bad_process()
might bring together my wish to continue serialization, David's wish
to avoid stupid panics, and your wish to avoid deadlocks.

Though any serialization has a risk of deadlock: we probably need to
weigh up how realistic different cases are.   Which brings me neatly
to your little pthread_create, ptrace proggy... I dare say you and
Kosaki and David know exactly what it's doing and why it's a problem,
but even after repeated skims of the ptrace manpage,  I'll admit to
not having a clue, nor the inclination to run and then debug it to
find out the answer.  Please, Oleg, would you mind very much
explaining it to me? I don't even know if the double pthread_create is
a vital part of the scheme, or just a typo.  I see it doesn't even
allocate any special memory, so I assume it leaves a PF_EXITING around
forever, but I couldn't quite see how (with PF_EXITING being set after
the tracehook_report_exit).  And I wonder if a similar case can be
constructed to deadlock the for_each_process version of
select_bad_process().

I worry more about someone holding a reference to the mm via /proc (I
see memory allocations after getting the mm).

Thanks; until later,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href