Re: [PATCH] mm, oom: check memcg margin for parallel oom

Yafang Shao <laoar.shao@xxxxxxxxx> · Tue, 14 Jul 2020 21:25:04 +0800

On Tue, Jul 14, 2020 at 8:37 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>
> On Tue 14-07-20 08:08:32, Yafang Shao wrote:
> > The commit 7775face2079 ("memcg: killed threads should not invoke memcg OOM
> > killer") resolves the problem that different threads in a multi-threaded
> > task doing parallel memcg oom, but it doesn't solve the problem that
> > different tasks doing parallel memcg oom.
> >
> > It may happens that many different tasks in the same memcg are waiting
> > oom_lock at the same time, if one of them has already made progress and
> > freed enough available memory, the others don't need to trigger the oom
> > killer again. By checking memcg margin after hold oom_lock can help
> > achieve it.
>
> While the changelog makes sense it I believe it can be improved. I would
> use something like the following. Feel free to use its parts.
>
> "
> Memcg oom killer invocation is synchronized by the global oom_lock and
> tasks are sleeping on the lock while somebody is selecting the victim or
> potentially race with the oom_reaper is releasing the victim's memory.
> This can result in a pointless oom killer invocation because a waiter
> might be racing with the oom_reaper
>
>         P1              oom_reaper              P2
>                         oom_reap_task           mutex_lock(oom_lock)
>                                                 out_of_memory # no victim because we have one already
>                         __oom_reap_task_mm      mute_unlock(oom_lock)
> mutex_lock(oom_lock)
>                         set MMF_OOM_SKIP
> select_bad_process
> # finds a new victim
>
> The page allocator prevents from this race by trying to allocate after
> the lock can be acquired (in __alloc_pages_may_oom) which acts as a last
> minute check. Moreover page allocator simply doesn't block on the
> oom_lock and simply retries the whole reclaim process.
>
> Memcg oom killer should do the last minute check as well. Call
> mem_cgroup_margin to do that. Trylock on the oom_lock could be done as
> well but this doesn't seem to be necessary at this stage.
> "
>
> > Suggested-by: Michal Hocko <mhocko@xxxxxxxxxx>
> > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> > Cc: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> > Cc: David Rientjes <rientjes@xxxxxxxxxx>
> > ---
> >  mm/memcontrol.c | 19 +++++++++++++++++--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 1962232..df141e1 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1560,16 +1560,31 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >               .gfp_mask = gfp_mask,
> >               .order = order,
> >       };
> > -     bool ret;
> > +     bool ret = true;
> >
> >       if (mutex_lock_killable(&oom_lock))
> >               return true;
> > +
> >       /*
> >        * A few threads which were not waiting at mutex_lock_killable() can
> >        * fail to bail out. Therefore, check again after holding oom_lock.
> >        */
> > -     ret = should_force_charge() || out_of_memory(&oc);
> > +     if (should_force_charge())
> > +             goto out;
> > +
> > +     /*
> > +      * Different tasks may be doing parallel oom, so after hold the
> > +      * oom_lock the task should check the memcg margin again to check
> > +      * whether other task has already made progress.
> > +      */
> > +     if (mem_cgroup_margin(memcg) >= (1 << order))
> > +             goto out;
>
> Is there any reason why you simply haven't done this? (+ your comment
> which is helpful).
>

No strong reason.
I just think that threads of a multi-thread task are more likely to do
parallel OOM, so I checked it first.
I can change it as you suggested below,  as it is more simple.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 248e6cad0095..2c176825efe3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1561,15 +1561,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>                 .order = order,
>                 .chosen_points = LONG_MIN,
>         };
> -       bool ret;
> +       bool ret = true;
>
> -       if (mutex_lock_killable(&oom_lock))
> +       if (!mutex_trylock(&oom_lock))
>                 return true;
> +
> +       if (mem_cgroup_margin(memcg) >= (1 << order))
> +               goto unlock;
> +
>         /*
>          * A few threads which were not waiting at mutex_lock_killable() can
>          * fail to bail out. Therefore, check again after holding oom_lock.
>          */
>         ret = should_force_charge() || out_of_memory(&oc);
> +
> +unlock:
>         mutex_unlock(&oom_lock);
>         return ret;
>  }
> > +
> > +     ret = out_of_memory(&oc);
> > +
> > +out:
> >       mutex_unlock(&oom_lock);
> > +
> >       return ret;
> >  }
> >
> > --
> > 1.8.3.1
>
> --
> Michal Hocko
> SUSE Labs

-- 
Thanks
Yafang