Hi Oleg, On Sat, Mar 12, 2011 at 5:43 AM, Oleg Nesterov <oleg@xxxxxxxxxx> wrote: > On 03/12, Oleg Nesterov wrote: >> >> On 03/11, David Rientjes wrote: >> > >> > On Wed, 9 Mar 2011, Andrew Morton wrote: >> > >> > > If Oleg's test program cause a hang with >> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't >> > > cause a hang without >> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a >> > > big problem for >> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no? >> > > >> > >> > It's a problem, but not because of >> > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch. >> >> It is, afaics. oom-killer can't ussume that a single PF_EXITING && p->mm >> thread is going to free the memory. >> >> > If we don't >> > have this patch, then we have a trivial panic when an oom kill occurs in a >> > cpuset with no other eligible processes, the oom killed thread group >> > leader exits >> >> It is not clear what "leader exits" actually mean. OK, perhaps you mean >> its ->mm == NULL. >> >> > but its other threads do not and they trigger oom kills >> > themselves. Âfor_each_process() does not iterate over these threads and so >> > it finds no eligible threads to kill and then panics >> >> Could you explain what do you mean? No need to kill these threads, they >> are already killed, we should wait until they all exit. >> >> > I'll look at Oleg's test case >> > and see what can be done to fix that condition, but the answer isn't to >> > ignore eligible threads that can be killed. >> >> Once again, they are already killed. Or I do not understand what you meant. >> >> Could you please explain the problem in more details? >> >> >> Also. Could you please look at the patches I sent? >> >> Â Â Â [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE >> Â Â Â [PATCH v2 1/1] select_bad_process: improve the PF_EXITING check > > Cough. And both were not right, while_each_thread(p, t) needs the properly > initialized "t". At least I warned they were not tested ;) > >> Note also the note about "p == current" check. it should be fixed too. > > I am resending the fixes above plus the new one. I've spent much of the week building up to join in, but the more I look around, the more I find to say or investigate, and therefore never quite get to write the mail. Let this be a placeholder, that I probably disagree (in an amicable way!) with all of you, and maybe I'll finally manage to collect my thoughts into mail later today. I guess my main point will be that TIF_MEMDIE serves a number of slightly different, even conflicting, purposes; and one of those purposes, which present company seems to ignore repeatedly, is to serialize access to final reserves of memory - as a comment by Nick in select_bad_process() makes clear. (This is distinct from the serialization to avoid OOM-killing rampage.) We _might_ choose to abandon that, but if so, it should be a decision, not an oversight. So I cannot blindly agree with just setting TIF_MEMDIE on more and more tasks, even if they share the same mm. I wonder if use of your find_lock_task_mm() in select_bad_process() might bring together my wish to continue serialization, David's wish to avoid stupid panics, and your wish to avoid deadlocks. Though any serialization has a risk of deadlock: we probably need to weigh up how realistic different cases are. Which brings me neatly to your little pthread_create, ptrace proggy... I dare say you and Kosaki and David know exactly what it's doing and why it's a problem, but even after repeated skims of the ptrace manpage, I'll admit to not having a clue, nor the inclination to run and then debug it to find out the answer. Please, Oleg, would you mind very much explaining it to me? I don't even know if the double pthread_create is a vital part of the scheme, or just a typo. I see it doesn't even allocate any special memory, so I assume it leaves a PF_EXITING around forever, but I couldn't quite see how (with PF_EXITING being set after the tracehook_report_exit). And I wonder if a similar case can be constructed to deadlock the for_each_process version of select_bad_process(). I worry more about someone holding a reference to the mm via /proc (I see memory allocations after getting the mm). Thanks; until later, Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href