On Thu, 16 Jul 2020, Michal Hocko wrote: > > I don't think moving the mem_cgroup_margin() check to out_of_memory() > > right before printing the oom info and killing the process is a very > > invasive patch. Any strong preference against doing it that way? I think > > moving the check as late as possible to save a process from being killed > > when racing with an exiter or killed process (including perhaps current) > > has a pretty clear motivation. > > We have been through this discussion several times in the past IIRC > The conclusion has been that the allocator (charging path for > the memcg) is the one to define OOM situation. This is an inherently > racy situation as long as we are not synchronizing oom with the world, > which I believe we agree, we do not want to do. There are few exceptions > to bail out early from the oom under certain situations and the trend > was to remove some of the existing ones rather than adding new because > they had subtle side effects and were prone to lockups. > > As much as it might sound attractive to move mem_cgroup_margin resp. > last allocation attempt closer to the actual oom killing I haven't seen > any convincing data that would support that such a change would make a > big difference. select_bad_process is not a free operation as it scales > with the number of tasks in the oom domain but it shouldn't be a super > expensive. The oom reporting is by far the most expensive part of the > operation. > > That being said, really convincing data should be presented in order > to do such a change. I do not think we want to do that just in case. It's not possible to present data because we've had such a check for years in our fleet so I can't say that it has prevented X unnecessary oom kills compared to doing the check prior to calling out_of_memory(). I'm hoping that can be understood. Since Yafang is facing the same issue, and there is no significant downside to doing the mem_cgroup_margin() check prior to oom_kill_process() (or checking task_will_free_mem(current)), and it's acknowledged that it *can* prevent unnecessary oom killing, which is a very good thing, I'd like to understand why such resistance to it. Killing a user process is a serious matter. I would fully agree if the margin is only one page: it's still better to kill something off. But when a process has uncharged memory by means induced by a process waiting on oom notication, such as a userspace kill or dropping of caches from your malloc implementation, that uncharge can be quite substantial and oom killing is then unnecessary. I can refresh the patch and send it formally.