On Fri, 17 Jul 2020, Yafang Shao wrote: > > > Actually the kernel is doing it now, see bellow, > > > > > > dump_header() <<<< dump lots of information > > > __oom_kill_process > > > p = find_lock_task_mm(victim); > > > if (!p) > > > return; <<<< without killing any process. > > > > > > > Ah, this is catching an instance where the chosen process has already done > > exit_mm(), good catch -- I can find examples of this by scraping kernel > > logs from our fleet. > > > > So it appears there is precedence for dumping all the oom info but not > > actually performing any action for it and I made the earlier point that > > diagnostic information in the kernel log here is still useful. I think it > > is still preferable that the kernel at least tell us why it didn't do > > anything, but as you mention that already happens today. > > > > Would you like to send a patch that checks for mem_cgroup_margin() here as > > well? A second patch could make the possible inaction more visibile, > > something like "Process ${pid} (${comm}) is already exiting" for the above > > check or "Memcg ${memcg} is no longer out of memory". > > > > Another thing that these messages indicate, beyond telling us why the oom > > killer didn't actually SIGKILL anything, is that we can expect some skew > > in the memory stats that shows an availability of memory. > > > > Agreed, these messages would be helpful. > I will send a patch for it. > Thanks Yafang. We should also continue talking about challenges you encounter with the oom killer either at the system level or for memcg limit ooms in a separate thread. It's clear that you are meeting several of the issues that we have previously seen ourselves. I could do a full audit of all our oom killer changes that may be interesting to you, but off the top of my head: - A means of triggering a memcg oom through the kernel: think of sysrq+f but scoped to processes attached to a memcg hierarchy. This allows userspace to reliably oom kill processes on overcommitted systems (SIGKILL can be insufficient if we depend on oom reaping, for example, to make forward progress) - Storing the state of a memcg's memory at the time reclaim has failed and we must oom kill: when the memcg oom killer is disabled so that userspace can handle it, if it triggers an oom kill through the kernel because it prefers an oom kill on an overcommitted system, we need to dump the state of the memory at oom rather than with the stack of the explicit trigger - Supplement memcg oom notification with an additional notification event on kernel oom kill: allows users to register for an event that triggers when the kernel oom killer kills something (and keeps a count of these events available for read) - Add a notion of an oom delay: on overcommitted systems, userspace may become unreliable or unresponsive despite our best efforts, this supplements the ability to disable the oom killer for a memcg hierarchy with the ability to disable it for a set period of time until the oom killer intervenes and kills something (last ditch effort). I'd be happy to discuss any of these topics if you are interested.