Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom

David Rientjes <rientjes@xxxxxxxxxx> · Thu, 16 Jul 2020 12:53:14 -0700 (PDT)

On Thu, 16 Jul 2020, Yafang Shao wrote:

> > It's likely a misunderstanding: I wasn't necessarily concerned about
> > showing 60MB in a memcg limited to 100MB, that part we can deal with, the
> > concern was after dumping all of that great information that instead of
> > getting a "Killed process..." we get a "Oh, there's memory now, just
> > kidding about everything we just dumped" ;)
> >
> 
> Actually the kernel is doing it now, see bellow,
> 
> dump_header() <<<< dump lots of information
> __oom_kill_process
>     p = find_lock_task_mm(victim);
>     if (!p)
>        return;   <<<< without killing any process.
> 

Ah, this is catching an instance where the chosen process has already done 
exit_mm(), good catch -- I can find examples of this by scraping kernel 
logs from our fleet.

So it appears there is precedence for dumping all the oom info but not 
actually performing any action for it and I made the earlier point that 
diagnostic information in the kernel log here is still useful.  I think it 
is still preferable that the kernel at least tell us why it didn't do 
anything, but as you mention that already happens today.

Would you like to send a patch that checks for mem_cgroup_margin() here as 
well?  A second patch could make the possible inaction more visibile, 
something like "Process ${pid} (${comm}) is already exiting" for the above 
check or "Memcg ${memcg} is no longer out of memory".

Another thing that these messages indicate, beyond telling us why the oom 
killer didn't actually SIGKILL anything, is that we can expect some skew 
in the memory stats that shows an availability of memory.

 > 
> > We could likely enlighten userspace about that so that we don't consider
> > that to be an actual oom kill.  But I also very much agree that after
> > dump_header() would be appropriate as well since the goal is to prevent
> > unnecessary oom killing.
> >
> > Would you mind sending a patch to check mem_cgroup_margin() on
> > is_memcg_oom() prior to sending the SIGKILL to the victim and printing the
> > "Killed process..." line?  We'd need a line that says "xKB of memory now
> > available -- suppressing oom kill" or something along those lines so
> > userspace understands what happened.  But the memory info that it emits
> > both for the state of the memcg and system RAM may also be helpful to
> > understand why we got to the oom kill in the first place, which is also a
> > good thing.
> >
> > I'd happy ack that patch since it would be a comprehensive solution that
> > avoids oom kill of user processes at all costs, which is a goal I think we
> > can all rally behind.
> 
> I'd prefer to put dump_header() behind do_send_sig_info(), for example,
> 
> __oom_kill_process()
>     do_send_sig_info()
>     dump_header() <<<< may better put it behind wake_oom_reaper(), but
> it may loses some information to dump...
>     pr_err("%s: Killed process %d (%s)....")
> 

I agree with Michal here that dump_header() after the actual kill would no 
longer represent the state of the system (or cpuset or memcg, depending on 
context) at the time of the oom kill so it's best to dump relevant 
information before the actual kill.