On Thu, 16 Jul 2020, Yafang Shao wrote: > > It's likely a misunderstanding: I wasn't necessarily concerned about > > showing 60MB in a memcg limited to 100MB, that part we can deal with, the > > concern was after dumping all of that great information that instead of > > getting a "Killed process..." we get a "Oh, there's memory now, just > > kidding about everything we just dumped" ;) > > > > Actually the kernel is doing it now, see bellow, > > dump_header() <<<< dump lots of information > __oom_kill_process > p = find_lock_task_mm(victim); > if (!p) > return; <<<< without killing any process. > Ah, this is catching an instance where the chosen process has already done exit_mm(), good catch -- I can find examples of this by scraping kernel logs from our fleet. So it appears there is precedence for dumping all the oom info but not actually performing any action for it and I made the earlier point that diagnostic information in the kernel log here is still useful. I think it is still preferable that the kernel at least tell us why it didn't do anything, but as you mention that already happens today. Would you like to send a patch that checks for mem_cgroup_margin() here as well? A second patch could make the possible inaction more visibile, something like "Process ${pid} (${comm}) is already exiting" for the above check or "Memcg ${memcg} is no longer out of memory". Another thing that these messages indicate, beyond telling us why the oom killer didn't actually SIGKILL anything, is that we can expect some skew in the memory stats that shows an availability of memory. > > > We could likely enlighten userspace about that so that we don't consider > > that to be an actual oom kill. But I also very much agree that after > > dump_header() would be appropriate as well since the goal is to prevent > > unnecessary oom killing. > > > > Would you mind sending a patch to check mem_cgroup_margin() on > > is_memcg_oom() prior to sending the SIGKILL to the victim and printing the > > "Killed process..." line? We'd need a line that says "xKB of memory now > > available -- suppressing oom kill" or something along those lines so > > userspace understands what happened. But the memory info that it emits > > both for the state of the memcg and system RAM may also be helpful to > > understand why we got to the oom kill in the first place, which is also a > > good thing. > > > > I'd happy ack that patch since it would be a comprehensive solution that > > avoids oom kill of user processes at all costs, which is a goal I think we > > can all rally behind. > > I'd prefer to put dump_header() behind do_send_sig_info(), for example, > > __oom_kill_process() > do_send_sig_info() > dump_header() <<<< may better put it behind wake_oom_reaper(), but > it may loses some information to dump... > pr_err("%s: Killed process %d (%s)....") > I agree with Michal here that dump_header() after the actual kill would no longer represent the state of the system (or cpuset or memcg, depending on context) at the time of the oom kill so it's best to dump relevant information before the actual kill.