Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom

Yafang Shao <laoar.shao@xxxxxxxxx> · Fri, 17 Jul 2020 09:35:36 +0800

On Fri, Jul 17, 2020 at 3:53 AM David Rientjes <rientjes@xxxxxxxxxx> wrote:
>
> On Thu, 16 Jul 2020, Yafang Shao wrote:
>
> > > It's likely a misunderstanding: I wasn't necessarily concerned about
> > > showing 60MB in a memcg limited to 100MB, that part we can deal with, the
> > > concern was after dumping all of that great information that instead of
> > > getting a "Killed process..." we get a "Oh, there's memory now, just
> > > kidding about everything we just dumped" ;)
> > >
> >
> > Actually the kernel is doing it now, see bellow,
> >
> > dump_header() <<<< dump lots of information
> > __oom_kill_process
> >     p = find_lock_task_mm(victim);
> >     if (!p)
> >        return;   <<<< without killing any process.
> >
>
> Ah, this is catching an instance where the chosen process has already done
> exit_mm(), good catch -- I can find examples of this by scraping kernel
> logs from our fleet.
>
> So it appears there is precedence for dumping all the oom info but not
> actually performing any action for it and I made the earlier point that
> diagnostic information in the kernel log here is still useful.  I think it
> is still preferable that the kernel at least tell us why it didn't do
> anything, but as you mention that already happens today.
>
> Would you like to send a patch that checks for mem_cgroup_margin() here as
> well?  A second patch could make the possible inaction more visibile,
> something like "Process ${pid} (${comm}) is already exiting" for the above
> check or "Memcg ${memcg} is no longer out of memory".
>
> Another thing that these messages indicate, beyond telling us why the oom
> killer didn't actually SIGKILL anything, is that we can expect some skew
> in the memory stats that shows an availability of memory.
>

Agreed, these messages would be helpful.
I will send a patch for it.

>  >
> > > We could likely enlighten userspace about that so that we don't consider
> > > that to be an actual oom kill.  But I also very much agree that after
> > > dump_header() would be appropriate as well since the goal is to prevent
> > > unnecessary oom killing.
> > >
> > > Would you mind sending a patch to check mem_cgroup_margin() on
> > > is_memcg_oom() prior to sending the SIGKILL to the victim and printing the
> > > "Killed process..." line?  We'd need a line that says "xKB of memory now
> > > available -- suppressing oom kill" or something along those lines so
> > > userspace understands what happened.  But the memory info that it emits
> > > both for the state of the memcg and system RAM may also be helpful to
> > > understand why we got to the oom kill in the first place, which is also a
> > > good thing.
> > >
> > > I'd happy ack that patch since it would be a comprehensive solution that
> > > avoids oom kill of user processes at all costs, which is a goal I think we
> > > can all rally behind.
> >
> > I'd prefer to put dump_header() behind do_send_sig_info(), for example,
> >
> > __oom_kill_process()
> >     do_send_sig_info()
> >     dump_header() <<<< may better put it behind wake_oom_reaper(), but
> > it may loses some information to dump...
> >     pr_err("%s: Killed process %d (%s)....")
> >
>
> I agree with Michal here that dump_header() after the actual kill would no
> longer represent the state of the system (or cpuset or memcg, depending on
> context) at the time of the oom kill so it's best to dump relevant
> information before the actual kill.

-- 
Thanks
Yafang