Re: [PATCH 3/3] memcg oom: bail out from the charge path if no victim found

Michal Hocko <mhocko@xxxxxxxxxx> · Mon, 20 Apr 2020 11:14:52 +0200

On Mon 20-04-20 16:52:05, Yafang Shao wrote:
> On Mon, Apr 20, 2020 at 4:13 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >
> > On Sat 18-04-20 11:13:11, Yafang Shao wrote:
[...]
> > > This patch is to improve it.
> > > If no victim found in memcg oom, we should force the current task to
> > > wait until there's available pages. That is similar with the behavior in
> > > memcg1 when oom_kill_disable is set.
> >
> > The primary reason why we force the charge is because we _cannot_ wait
> > indefinitely in the charge path because the current call chain might
> > hold locks or other resources which could block a large part of the
> > system. You are essentially reintroducing that behavior.
> >
> 
> Seems my poor English misleads you ?
> The task is NOT waiting in the charge path, while it is really waiting
> at the the end of the page fault, so it doesn't hold any locks.

How is that supposed to work? Sorry I didn't really study your patch
very closely because it doesn't apply on the current Linus' tree and
your previous 2 patches have reshuffled the code so it is not really
trivial to have a good picture of the overall logic change.

> See the comment above mem_cgroup_oom_synchronize()

Anyway mem_cgroup_oom_synchronize shouldn't really trigger unless the
oom handling is disabled (aka handed over to the userspace). All other
paths should handle the oom in the charge path. Please have a look at 
29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
for more background and motivation.

mem_cgroup_oom_synchronize was a workaround for deadlocks and the side
effect was that all other charge paths outside of #PF were failing
allocations prematurely and that had an effect to user space.

> > Is the above example a real usecase or you have just tried a test case
> > that would trigger the problem?
> 
> On my server I found the memory usage of a container was greater than
> the limit of it.
> From the dmesg I know there's no killable tasks becasue the
> oom_score_adj is set with -1000.

I would really recommend to address this problem in the userspace
configuration. Either by increasing the memory limit or fixing the
oom disabled userspace to not consume that much of a memory.

> Then I tried this test case to produce this issue.
> This issue can be triggerer by the misconfiguration of oom_score_adj,
> and can also be tiggered by a memoy leak in the task  with
> oom_score_adj -1000.

Please note that there is not much the system can do about oom disabled
tasks that leak memory. Even the global case would slowly kill all other
userspace until it panics due to no eligible tasks. The oom_score_adj
has a very strong consequences. Do not use it without a very careful
consideration.
-- 
Michal Hocko
SUSE Labs