On 2018/10/23 21:10, Michal Hocko wrote: > On Tue 23-10-18 13:42:46, Michal Hocko wrote: >> On Tue 23-10-18 10:01:08, Tetsuo Handa wrote: >>> Michal Hocko wrote: >>>> On Mon 22-10-18 20:45:17, Tetsuo Handa wrote: >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>>>>> index e79cb59552d9..a9dfed29967b 100644 >>>>>> --- a/mm/memcontrol.c >>>>>> +++ b/mm/memcontrol.c >>>>>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, >>>>>> .gfp_mask = gfp_mask, >>>>>> .order = order, >>>>>> }; >>>>>> - bool ret; >>>>>> + bool ret = true; >>>>>> >>>>>> mutex_lock(&oom_lock); >>>>>> + >>>>>> + /* >>>>>> + * multi-threaded tasks might race with oom_reaper and gain >>>>>> + * MMF_OOM_SKIP before reaching out_of_memory which can lead >>>>>> + * to out_of_memory failure if the task is the last one in >>>>>> + * memcg which would be a false possitive failure reported >>>>>> + */ >>>>>> + if (tsk_is_oom_victim(current)) >>>>>> + goto unlock; >>>>>> + >>>>> >>>>> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock) >>>>> so that any killed threads no longer wait for oom_lock. >>>> >>>> tsk_is_oom_victim is stronger because it doesn't depend on >>>> fatal_signal_pending which might be cleared throughout the exit process. >>>> >>> >>> I still want to propose this. No need to be memcg OOM specific. >> >> Well, I maintain what I've said [1] about simplicity and specific fix >> for a specific issue. Especially in the tricky code like this where all >> the consequences are far more subtle than they seem to be. >> >> This is obviously a matter of taste but I don't see much point discussing >> this back and forth for ever. Unless there is a general agreement that >> the above is less appropriate then I am willing to consider a different >> change but I simply do not have energy to nit pick for ever. >> >> [1] http://lkml.kernel.org/r/20181022134315.GF18839@xxxxxxxxxxxxxx > > In other words. Having a memcg specific fix means, well, a memcg > maintenance burden. Like any other memcg specific oom decisions we > already have. So are you OK with that Johannes or you would like to see > a more generic fix which might turn out to be more complex? > I don't know what "that Johannes" refers to. If you don't want to affect SysRq-OOM and pagefault-OOM cases, are you OK with having a global-OOM specific fix? mm/page_alloc.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e2ef1c1..f59f029 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3518,6 +3518,17 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) if (gfp_mask & __GFP_THISNODE) goto out; + /* + * It is possible that multi-threaded OOM victims get + * task_will_free_mem(current) == false when the OOM reaper quickly + * set MMF_OOM_SKIP. But since we know that tsk_is_oom_victim() == true + * tasks won't loop forever (unless it is a __GFP_NOFAIL allocation + * request), we don't need to select next OOM victim. + */ + if (tsk_is_oom_victim(current) && !(gfp_mask & __GFP_NOFAIL)) { + *did_some_progress = 1; + goto out; + } /* Exhausted what can be done so it's blame time */ if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { *did_some_progress = 1; -- 1.8.3.1