Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom

David Rientjes <rientjes@xxxxxxxxxx> · Wed, 15 Jul 2020 23:56:11 -0700 (PDT)

On Thu, 16 Jul 2020, Michal Hocko wrote:

> > I don't think moving the mem_cgroup_margin() check to out_of_memory() 
> > right before printing the oom info and killing the process is a very 
> > invasive patch.  Any strong preference against doing it that way?  I think 
> > moving the check as late as possible to save a process from being killed 
> > when racing with an exiter or killed process (including perhaps current) 
> > has a pretty clear motivation.
> 
> We have been through this discussion several times in the past IIRC
> The conclusion has been that the allocator (charging path for
> the memcg) is the one to define OOM situation. This is an inherently
> racy situation as long as we are not synchronizing oom with the world,
> which I believe we agree, we do not want to do. There are few exceptions
> to bail out early from the oom under certain situations and the trend
> was to remove some of the existing ones rather than adding new because
> they had subtle side effects and were prone to lockups.
> 
> As much as it might sound attractive to move mem_cgroup_margin resp.
> last allocation attempt closer to the actual oom killing I haven't seen
> any convincing data that would support that such a change would make a
> big difference. select_bad_process is not a free operation as it scales
> with the number of tasks in the oom domain but it shouldn't be a super
> expensive. The oom reporting is by far the most expensive part of the
> operation.
> 
> That being said, really convincing data should be presented in order
> to do such a change. I do not think we want to do that just in case.

It's not possible to present data because we've had such a check for years 
in our fleet so I can't say that it has prevented X unnecessary oom kills 
compared to doing the check prior to calling out_of_memory().  I'm hoping 
that can be understood.

Since Yafang is facing the same issue, and there is no significant 
downside to doing the mem_cgroup_margin() check prior to 
oom_kill_process() (or checking task_will_free_mem(current)), and it's 
acknowledged that it *can* prevent unnecessary oom killing, which is a 
very good thing, I'd like to understand why such resistance to it.

Killing a user process is a serious matter.  I would fully agree if the 
margin is only one page: it's still better to kill something off.  But 
when a process has uncharged memory by means induced by a process waiting 
on oom notication, such as a userspace kill or dropping of caches from 
your malloc implementation, that uncharge can be quite substantial and oom 
killing is then unnecessary.

I can refresh the patch and send it formally.