On Thu, Jul 16, 2020 at 3:04 PM David Rientjes <rientjes@xxxxxxxxxx> wrote: > > On Thu, 16 Jul 2020, Yafang Shao wrote: > > > > But yes, we store vital > > > information about the memcg at the time of the first oom event when the > > > oom killer is disabled (to allow userspace to determine what the best > > > course of action is). > > > > > > > It would be better if you could upstream the features in your kernel, > > and I think it could also help the other users. > > > > Everything we've discussed so far has been proposed in the past, actually. > I think we stress the oom killer and use it at scale that others do not, > so only a subset of users would find it interesting. You are very likely > one of those subset of users. > > We should certainly talk about other issues that we have run into that > make the upstream oom killer unusable. Are there other areas that you're > currently focused on or having trouble with? I'd be happy to have a > discussion on how we have resolved a lot of its issues. > > > I understand what you mean "point of no return", but that seems a > > workaround rather than a fix. > > If you don't want to kill unnecessary processes, then checking the > > memcg margin before sending sigkill is better, because as I said > > before the race will be most likely happening in dump_header(). > > If you don't want to show strange OOM information like "your process > > was oom killed and it shows usage is 60MB in a memcg limited > > to 100MB", it is better to get the snapshot of the OOM when it is > > triggered and then show it later, and I think it could also apply to > > the global OOM. > > > > It's likely a misunderstanding: I wasn't necessarily concerned about > showing 60MB in a memcg limited to 100MB, that part we can deal with, the > concern was after dumping all of that great information that instead of > getting a "Killed process..." we get a "Oh, there's memory now, just > kidding about everything we just dumped" ;) > Actually the kernel is doing it now, see bellow, dump_header() <<<< dump lots of information __oom_kill_process p = find_lock_task_mm(victim); if (!p) return; <<<< without killing any process. > We could likely enlighten userspace about that so that we don't consider > that to be an actual oom kill. But I also very much agree that after > dump_header() would be appropriate as well since the goal is to prevent > unnecessary oom killing. > > Would you mind sending a patch to check mem_cgroup_margin() on > is_memcg_oom() prior to sending the SIGKILL to the victim and printing the > "Killed process..." line? We'd need a line that says "xKB of memory now > available -- suppressing oom kill" or something along those lines so > userspace understands what happened. But the memory info that it emits > both for the state of the memcg and system RAM may also be helpful to > understand why we got to the oom kill in the first place, which is also a > good thing. > > I'd happy ack that patch since it would be a comprehensive solution that > avoids oom kill of user processes at all costs, which is a goal I think we > can all rally behind. I'd prefer to put dump_header() behind do_send_sig_info(), for example, __oom_kill_process() do_send_sig_info() dump_header() <<<< may better put it behind wake_oom_reaper(), but it may loses some information to dump... pr_err("%s: Killed process %d (%s)....") Because the main goal of OOM is to kill a process to free pages ASAP to avoid system stall or memcg stall. We all find that dump_header() may take a long time to finish especially if there is a slow console, and this long time may cause a great system stall, so we'd better defer the dump of it. But that should be another topic. -- Thanks Yafang