Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom

Yafang Shao <laoar.shao@xxxxxxxxx> · Thu, 16 Jul 2020 19:53:12 +0800

On Thu, Jul 16, 2020 at 3:04 PM David Rientjes <rientjes@xxxxxxxxxx> wrote:
>
> On Thu, 16 Jul 2020, Yafang Shao wrote:
>
> > >  But yes, we store vital
> > > information about the memcg at the time of the first oom event when the
> > > oom killer is disabled (to allow userspace to determine what the best
> > > course of action is).
> > >
> >
> > It would be better if you could upstream the features in your kernel,
> > and I think it could also help the other users.
> >
>
> Everything we've discussed so far has been proposed in the past, actually.
> I think we stress the oom killer and use it at scale that others do not,
> so only a subset of users would find it interesting.  You are very likely
> one of those subset of users.
>
> We should certainly talk about other issues that we have run into that
> make the upstream oom killer unusable.  Are there other areas that you're
> currently focused on or having trouble with?  I'd be happy to have a
> discussion on how we have resolved a lot of its issues.
>
> > I understand what you mean "point of no return", but that seems a
> > workaround rather than a fix.
> > If you don't want to kill unnecessary processes, then checking the
> > memcg margin before sending sigkill is better, because as I said
> > before the race will be most likely happening in dump_header().
> > If you don't want to show strange OOM information like "your process
> > was oom killed and it shows usage is 60MB in a memcg limited
> > to 100MB", it is better to get the snapshot of the OOM when it is
> > triggered and then show it later, and I think it could also apply to
> > the global OOM.
> >
>
> It's likely a misunderstanding: I wasn't necessarily concerned about
> showing 60MB in a memcg limited to 100MB, that part we can deal with, the
> concern was after dumping all of that great information that instead of
> getting a "Killed process..." we get a "Oh, there's memory now, just
> kidding about everything we just dumped" ;)
>

Actually the kernel is doing it now, see bellow,

dump_header() <<<< dump lots of information
__oom_kill_process
    p = find_lock_task_mm(victim);
    if (!p)
       return;   <<<< without killing any process.

> We could likely enlighten userspace about that so that we don't consider
> that to be an actual oom kill.  But I also very much agree that after
> dump_header() would be appropriate as well since the goal is to prevent
> unnecessary oom killing.
>
> Would you mind sending a patch to check mem_cgroup_margin() on
> is_memcg_oom() prior to sending the SIGKILL to the victim and printing the
> "Killed process..." line?  We'd need a line that says "xKB of memory now
> available -- suppressing oom kill" or something along those lines so
> userspace understands what happened.  But the memory info that it emits
> both for the state of the memcg and system RAM may also be helpful to
> understand why we got to the oom kill in the first place, which is also a
> good thing.
>
> I'd happy ack that patch since it would be a comprehensive solution that
> avoids oom kill of user processes at all costs, which is a goal I think we
> can all rally behind.

I'd prefer to put dump_header() behind do_send_sig_info(), for example,

__oom_kill_process()
    do_send_sig_info()
    dump_header() <<<< may better put it behind wake_oom_reaper(), but
it may loses some information to dump...
    pr_err("%s: Killed process %d (%s)....")

Because the main goal of OOM is to kill a process to free pages ASAP
to avoid system stall or memcg stall.
We all find that  dump_header() may take a long time to finish
especially if there is a slow console, and this long time may cause a
great system stall, so we'd better defer the dump of it.

But that should be another topic.

-- 
Thanks
Yafang