On Tue 02-08-16 19:32:45, Tetsuo Handa wrote: > Michal Hocko wrote: > > > > > It is possible that a user creates a process with 10000 threads > > > > > and let that process be OOM-killed. Then, this patch allows 10000 threads > > > > > to start consuming memory reserves after they left exit_mm(). OOM victims > > > > > are not the only threads who need to allocate memory for termination. Non > > > > > OOM victims might need to allocate memory at exit_task_work() in order to > > > > > allow OOM victims to make forward progress. > > > > > > > > this might be possible but unlike the regular exiting tasks we do > > > > reclaim oom victim's memory in the background. So while they can consume > > > > memory reserves we should also give some (and arguably much more) memory > > > > back. The reserves are there to expedite the exit. > > > > > > Background reclaim does not occur on CONFIG_MMU=n kernels. But this patch > > > also affects CONFIG_MMU=n kernels. If a process with two threads was > > > OOM-killed and one thread consumed too much memory after it left exit_mm() > > > before the other thread sets MMF_OOM_SKIP on their mm by returning from > > > exit_aio() etc. in __mmput() from mmput() from exit_mm(), this patch > > > introduces a new possibility to OOM livelock. I think it is wild to assume > > > that "CONFIG_MMU=n kernels can OOM livelock even without this patch. Thus, > > > let's apply this patch even though this patch might break the balance of > > > OOM handling in CONFIG_MMU=n kernels." > > > > As I've said if you have strong doubts about the patch I can drop it for > > now. I do agree that nommu really matters here, though. > > OK. Then, for now let's postpone only the oom_killer_disbale() to later > rather than postpone the exit_oom_victim() to later. that would require other changes (basically make oom_killer_disbale independent on TIF_MEMDIE) which I think doesn't belong to this pile. So I would rather sacrifice this patch instead and it will not be part of the v2. [...] > > > > > I think that allocations from > > > > > do_exit() are important for terminating cleanly (from the point of view of > > > > > filesystem integrity and kernel object management) and such allocations > > > > > should not be given up simply because ALLOC_NO_WATERMARKS allocations > > > > > failed. > > > > > > > > We are talking about a fatal condition when OOM killer forcefully kills > > > > a task. Chances are that the userspace leaves so much state behind that > > > > a manual cleanup would be necessary anyway. Depleting the memory > > > > reserves is not nice but I really believe that this particular patch > > > > doesn't make the situation really much worse than before. > > > > > > I'm not talking about inconsistency in userspace programs. I'm talking > > > about inconsistency of objects managed by kernel (e.g. failing to drop > > > references) caused by allocation failures. > > > > That would be a bug on its own, no? > > Right, but memory allocations after exit_mm() from do_exit() (e.g. > exit_task_work()) might assume (or depend on) the "too small to fail" > memory-allocation rule where small GFP_FS allocations won't fail unless > TIF_MEMDIE is set, but this patch can unexpectedly break that rule if > they assume (or depend on) that rule. Silent dependency on nofail semantic withtou GFP_NOFAIL is still a bug. Full stop. I really fail to see why you are still arguing about that. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>