Re: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM access to memory reserves

David Rientjes <rientjes@xxxxxxxxxx> · Mon, 20 Jan 2014 22:13:21 -0800 (PST)

On Thu, 16 Jan 2014, Michal Hocko wrote:

> > The heuristic may have existed for ages, but the proposed memcg 
> > configuration for preserving memory such that userspace oom handlers may 
> > run such as
> > 
> > 			 _____root______
> > 			/		\
> > 		    user		 oom
> > 		   /	\		/   \
> > 		   A	B	 	a   b
> > 
> > where user/memory.limit_in_bytes == [amount of present RAM] + 
> > oom/memory.limit_in_bytes - [some fudge] causes all bypasses to be 
> > problematic, including Johannes' buggy bypass for charges in memcgs with 
> > pending memcgs that has since been fixed after I identified it.  This 
> > bypass is included.  Processes attached to "a" and "b" are userspace oom 
> > handlers for processes attached to "A" and "B", respectively.
> > 
> > The amount of memory you're talking about is proportional to the number of 
> > processes that have pending SIGKILLs (and now those with PF_EXITING set), 
> > the former of which is obviously more concerning since they could be 
> > charging memory at any point in the kernel that would succeed. 
> 
> I understand your concerns. Yes, excessive charges might be dangerous. I
> haven't dismissed that when you mentioned it earlier. I am just
> repeatedly asking how much memory are we talking about, how real is the
> issue and what are all the other conseqeunces. And for some reason you
> are not providing that information (or maybe I am just not seeing that
> in your responses) and that is why we are stuck in circle.
> 

Wtf are you talking about?  You're adding a bypass in this patch and then 
you're asking me to go and see how much memory it could potentially bypass 
and take away from oom handlers under the above memcg configuration?  This 
seems like something you should provide before throwing out patches that 
nobody has tested if you want to make the argument that the above memcg 
configuration is valid for handling userspace oom notifications.

And you certainly have dismissed what I've mentioned earlier when I said 
that anybody can add memory allocation to the exit path later on and 
nobody is going to think about how much memory this is going to bypass to 
the root memcg and potentially take away from userspace oom handlers.

There's two possible ways to forward this:

 - avoid bypass to the root memcg in every possible case such that the
   above memcg configuration actually makes a guarantee to userspace oom
   handlers attached to it, or

 - provide per-memcg memory reserves such that userspace oom handlers can
   allocate and charge memory without the above memcg configuration so 
   there is a guarantee.

What's not acceptable, now or ever, is suggesting a solution to a problem 
that is supposed to guarantee some resource and then allow under some 
circumstances that resource to be completely depleted such that the 
solution never works.

> Yes, and apart from GFP_NOFAIL we are allowing to bypass only those that
> should terminate in a short time. I think that having a setup with a
> guarantee of never triggering the global OOM is too ambitious and I am
> even skeptical it would be achievable.
> 

"Short time" is meaningless if the memory allocation causes memory to not 
be available to userspace oom handlers.  If allocations are allowed to be 
charged because you're in the exit() path or because you have SIGKILL, 
that can result in a system oom condition that would prevent userspace 
from being able to handle them.

> > I'm debating both fatal_signal_pending() and PF_EXITING here since they 
> > are now both bypasses, we need to remove fatal_signal_pending().  My 
> > simple question with your patch: how do you guarantee memory to processes 
> > attached to "a" and "b"?
> 
> The only way you can get that _guarantee_ is to account all the memory
> allocations. And that is not implemented and I would even question
> whether it is worthwhile. So we still have to live with a possibility
> of triggering the global OOM killer. That's why I believe we need to be
> able to tell the kernel what is the user policy for oom killer (that is
> a different discussion though).
> 

So you're saying that Tejun's suggested userspace oom handler 
configuration is pointless, correct?  We can certainly provide a guarantee 
if memory is reserved specifically for userspace oom handling like I 
proposed, the same way that memory reserves are guaranteed for oom killed 
processes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>