Re: [PATCH 3/3] memcg oom: bail out from the charge path if no victim found

Yafang Shao <laoar.shao@xxxxxxxxx> · Mon, 20 Apr 2020 16:52:05 +0800

On Mon, Apr 20, 2020 at 4:13 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>
> On Sat 18-04-20 11:13:11, Yafang Shao wrote:
> > Without considering the manually triggered OOM, if no victim found in
> > system OOM, the system will be deadlocked on memory, however if no
> > victim found in memcg OOM, it can charge successfully and runs well.
> > This behavior in memcg oom is not proper because that can prevent the
> > memcg from being limited.
> >
> > Take an easy example.
> >         $ cd /sys/fs/cgroup/foo/
> >       $ echo $$ > cgroup.procs
> >       $ echo 200M > memory.max
> >       $ cat memory.max
> >       209715200
> >       $ echo -1000 > /proc/$$/oom_score_adj
> > Then, let's run a memhog task in memcg foo, which will allocate 1G
> > memory and keeps running.
> >       $ /home/yafang/test/memhog &
>
> Well, echo -1000 is a privileged operation. And it has to be used with
> an extreme care because you know that you are creating an unkillable
> task. So the above test is a clear example of the misconfiguration.
>

Right. This issue is really tiggered by the misconfiguration.

> > Then memory.current will be greater than memory.max. Run bellow command
> > in another shell.
> >       $ cat /sys/fs/cgroup/foo/memory.current
> >       1097228288
> > The tasks which have already allocated memory and won't allocate new
> > memory still runs well. This behavior makes nonsense.
> >
> > This patch is to improve it.
> > If no victim found in memcg oom, we should force the current task to
> > wait until there's available pages. That is similar with the behavior in
> > memcg1 when oom_kill_disable is set.
>
> The primary reason why we force the charge is because we _cannot_ wait
> indefinitely in the charge path because the current call chain might
> hold locks or other resources which could block a large part of the
> system. You are essentially reintroducing that behavior.
>

Seems my poor English misleads you ?
The task is NOT waiting in the charge path, while it is really waiting
at the the end of the page fault, so it doesn't hold any locks.
See the comment above mem_cgroup_oom_synchronize()

/*
 *  ...
 * Memcg supports userspace OOM handling where failed allocations must
 * sleep on a waitqueue until the userspace task resolves the
 * situation.  Sleeping directly in the charge context with all kinds
 * of locks held is not a good idea, instead we remember an OOM state
 * in the task and mem_cgroup_oom_synchronize() has to be called at
 * the end of the page fault to complete the OOM handling.
 * ...
 */
bool mem_cgroup_oom_synchronize(bool handle)

> Is the above example a real usecase or you have just tried a test case
> that would trigger the problem?

On my server I found the memory usage of a container was greater than
the limit of it.
>From the dmesg I know there's no killable tasks becasue the
oom_score_adj is set with -1000.
Then I tried this test case to produce this issue.
This issue can be triggerer by the misconfiguration of oom_score_adj,
and can also be tiggered by a memoy leak in the task  with
oom_score_adj -1000.

Thanks
Yafang