On Tue, Jul 14, 2020 at 8:37 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > On Tue 14-07-20 08:08:32, Yafang Shao wrote: > > The commit 7775face2079 ("memcg: killed threads should not invoke memcg OOM > > killer") resolves the problem that different threads in a multi-threaded > > task doing parallel memcg oom, but it doesn't solve the problem that > > different tasks doing parallel memcg oom. > > > > It may happens that many different tasks in the same memcg are waiting > > oom_lock at the same time, if one of them has already made progress and > > freed enough available memory, the others don't need to trigger the oom > > killer again. By checking memcg margin after hold oom_lock can help > > achieve it. > > While the changelog makes sense it I believe it can be improved. I would > use something like the following. Feel free to use its parts. > > " > Memcg oom killer invocation is synchronized by the global oom_lock and > tasks are sleeping on the lock while somebody is selecting the victim or > potentially race with the oom_reaper is releasing the victim's memory. > This can result in a pointless oom killer invocation because a waiter > might be racing with the oom_reaper > > P1 oom_reaper P2 > oom_reap_task mutex_lock(oom_lock) > out_of_memory # no victim because we have one already > __oom_reap_task_mm mute_unlock(oom_lock) > mutex_lock(oom_lock) > set MMF_OOM_SKIP > select_bad_process > # finds a new victim > > The page allocator prevents from this race by trying to allocate after > the lock can be acquired (in __alloc_pages_may_oom) which acts as a last > minute check. Moreover page allocator simply doesn't block on the > oom_lock and simply retries the whole reclaim process. > > Memcg oom killer should do the last minute check as well. Call > mem_cgroup_margin to do that. Trylock on the oom_lock could be done as > well but this doesn't seem to be necessary at this stage. > " > > > Suggested-by: Michal Hocko <mhocko@xxxxxxxxxx> > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > > Cc: Michal Hocko <mhocko@xxxxxxxxxx> > > Cc: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> > > Cc: David Rientjes <rientjes@xxxxxxxxxx> > > --- > > mm/memcontrol.c | 19 +++++++++++++++++-- > > 1 file changed, 17 insertions(+), 2 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 1962232..df141e1 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1560,16 +1560,31 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > .gfp_mask = gfp_mask, > > .order = order, > > }; > > - bool ret; > > + bool ret = true; > > > > if (mutex_lock_killable(&oom_lock)) > > return true; > > + > > /* > > * A few threads which were not waiting at mutex_lock_killable() can > > * fail to bail out. Therefore, check again after holding oom_lock. > > */ > > - ret = should_force_charge() || out_of_memory(&oc); > > + if (should_force_charge()) > > + goto out; > > + > > + /* > > + * Different tasks may be doing parallel oom, so after hold the > > + * oom_lock the task should check the memcg margin again to check > > + * whether other task has already made progress. > > + */ > > + if (mem_cgroup_margin(memcg) >= (1 << order)) > > + goto out; > > Is there any reason why you simply haven't done this? (+ your comment > which is helpful). > No strong reason. I just think that threads of a multi-thread task are more likely to do parallel OOM, so I checked it first. I can change it as you suggested below, as it is more simple. > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 248e6cad0095..2c176825efe3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1561,15 +1561,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > .order = order, > .chosen_points = LONG_MIN, > }; > - bool ret; > + bool ret = true; > > - if (mutex_lock_killable(&oom_lock)) > + if (!mutex_trylock(&oom_lock)) > return true; > + > + if (mem_cgroup_margin(memcg) >= (1 << order)) > + goto unlock; > + > /* > * A few threads which were not waiting at mutex_lock_killable() can > * fail to bail out. Therefore, check again after holding oom_lock. > */ > ret = should_force_charge() || out_of_memory(&oc); > + > +unlock: > mutex_unlock(&oom_lock); > return ret; > } > > + > > + ret = out_of_memory(&oc); > > + > > +out: > > mutex_unlock(&oom_lock); > > + > > return ret; > > } > > > > -- > > 1.8.3.1 > > -- > Michal Hocko > SUSE Labs -- Thanks Yafang