Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

David Rientjes <rientjes@xxxxxxxxxx> · Mon, 9 Dec 2013 12:10:44 -0800 (PST)

On Fri, 6 Dec 2013, Tejun Heo wrote:

> > Tejun, how are you?
> 
> Doing pretty good.  How's yourself? :)
> 

Not bad, busy with holidays and all that.

> > I agree that we wouldn't need such support if we are only addressing memcg 
> > oom conditions.  We could do things like A/memory.limit_in_bytes == 128M 
> > and A/b/memory.limit_in_bytes == 126MB and then attach the process waiting 
> > on A/b/memory.oom_control to A and that would work perfect.
> 
> Or even just create a separate parallel cgroup A/memory.limit_in_bytes
> == 126M A-oom/memory.limit_in_bytes = 2M and avoid the extra layer of
> nesting.
> 

Indeed.  The setup I'm specifically trying to attack is where the sum of 
the limits of all non-oom handling memcgs (A/b in my model, A in yours) 
exceed the amount of RAM.  If the system has 256MB,

				/=256MB
	A=126MB		A-oom=2MB	B=188MB		B-oom=4MB

or

			/=256MB
	C=128MB				D=192MB
	C/a=126M			D/a=188MB

then it's possible for A + B or C/a + D/a to cause a system oom condition 
and meanwhile A-oom/tasks, B-oom/tasks, C/tasks, and D/tasks cannot 
allocate memory to handle it.

> > However, we also need to discuss system oom handling.  We have an interest 
> > in being able to allow userspace to handle system oom conditions since the 
> > policy will differ depending on machine and we can't encode every possible 
> > mechanism into the kernel.  For example, on system oom we want to kill a 
> > process from the lowest priority top-level memcg.  We lack that ability 
> > entirely in the kernel and since the sum of our top-level memcgs 
> > memory.limit_in_bytes exceeds the amount of present RAM, we run into these 
> > oom conditions a _lot_.
> > 
> > So the first step, in my opinion, is to add a system oom notification on 
> > the root memcg's memory.oom_control which currently allows registering an 
> > eventfd() notification but never actually triggers.  I did that in a patch 
> > and it is was merged into -mm but was pulled out for later discussion.
> 
> Hmmm... this seems to be a different topic.  You're saying that it'd
> be beneficial to add userland oom handling at the sytem level and if
> that happens having per-memcg oom reserve would be consistent with the
> system-wide one, right?

Right, and apologies for not discussing the system oom handling here since 
its notification on the root memcg is currently being debated as well.  
The idea is that admins and users aren't going to be concerned about 
memory allocation through the page allocator vs memory charging through 
the memory controller; they simply want memory for their userspace oom 
handling.  And since the notification would be tied to the root memcg, it 
makes sense to make the amount of memory allowed to allocate exclusively 
for these handlers a memcg interface.  So the cleanest solution, in my 
opinion, was to add the interface as part of memcg.

> While I can see some merit in that argument,
> the whole thing is predicated on system level userland oom handling
> being justified && even then I'm not quite sure whether "consistent
> interface" is enough to have oom reserve in all memory cgroups.  It
> feels a bit backwards because, here, the root memcg is the exception,
> not the other way around.  Root is the only one which can't put oom
> handler in a separate cgroup, so it could make more sense to special
> case that rather than spreading the interface for global userland oom
> to everyone else.
> 

It's really the same thing, though, from the user perspective.  They don't 
care about page allocation failure vs memcg charge failure, they simply 
want to ensure that the memory set aside for memory.oom_reserve_in_bytes 
is available in oom conditions.  With the suggested alternatives:

				/=256MB
	A=126MB		A-oom=2MB	B=188MB		B-oom=4MB

or

			/=256MB
	C=128MB				D=192MB
	C/a=126M			D/a=188MB

we can't distinguish between what is able to allocate below per-zone min 
watermarks in the page allocator as the oom reserve.  The key point is 
that the root memcg is not the only memcg concerned with page allocator 
memory reserves, it's any oom reserve.  If A's usage is 124MB and B's 
usage is 132MB, we can't specify that processes attached to B-oom should 
be able to bypass per-zone min watermarks without an interface such as 
that being proposed.

> But, before that, system level userland OOM handling sounds scary to
> me.  I thought about userland OOM handling for memcgs and it does make
> some sense.  ie. there is a different action that userland oom handler
> can take which kernel oom handler can't - it can expand the limit of
> the offending cgroup, effectively using OOM handler as a sizing
> estimator.  I'm not sure whether that in itself is a good idea but
> then again it might not be possible to clearly separate out sizing
> from oom conditions.
> 
> Anyways, but for system level OOM handling, there's no other action
> userland handler can take.  It's not like the OOM handler paging the
> admin to install more memory is a reasonable mode of operation to
> support.  The *only* action userland OOM handler can take is killing
> something.  Now, if that's the case and we have kernel OOM handler
> anyway, I think the best course of action is improving kernel OOM
> handler and teach it to make the decisions that the userland handler
> would consider good.  That should be doable, right?
> 

It's much more powerful than that; you're referring to the mechanism to 
guarantee future memory freeing so the system or memcg is no longer oom, 
and that's only one case of possible handling.  I have a customer who 
wants to save heap profiles at the time of oom as well, for example, and 
their sole desire is to be able to capture memory statistics before the 
oom kill takes place.  The sine qua non is that memory reserves allow 
something to be done in such conditions: if you try to do a "ps" or "ls" 
or cat a file in an oom memcg, you hang.  We need better functionality to 
ensure that we can do some action prior to the oom kill itself, whether 
that comes from userspace or the kernel.  We simply cannot rely on things 
like memory thresholds or vmpressure to grab these heap profiles, there is 
no guarantee that memory will not be exhausted and the oom kill would 
already have taken place before the process handling the notification 
wakes up.  (And any argument that it is possible by simply making the 
threshold happen early enough is a non-starter: it does not guarantee the 
heaps are collected for oom conditions and the oom kill can still occur 
prematurely in machines that overcommit their memcg limits, as we do.)

> The thing is OOM handling in userland is an inherently fragile thing
> and it can *never* replace kernel OOM handling.  You may reserve any
> amount of memory you want but there would still be cases that it may
> fail.  It's not like we have owner-based allocation all through the
> kernel or are willing to pay overhead for such thing.  Even if that
> part can be guaranteed somehow (no idea how), the kernel still can
> NEVER trust the userland OOM handler.  No matter what we do, we need a
> kernel OOM handler with no resource dependency.
> 

I was never an advocate for the current memory.oom_control behavior that 
allows you to disable the oom killer indefinitely for a memcg and I agree 
that it is dangerous if userspace will not cause future memory freeing or 
toggle the value such that the kernel will kill something.  So I agree 
with you with today's functionality, not with the functionality that this 
patchset, and the notification on the root memcg for system oom 
conditions, provides.  I also proposed a memory.oom_delay_millisecs that 
we have used for several years dating back to even cpusets that simply 
delays the oom kill such that userspace can do "something" like send a 
kill itself, collect heap profiles, send a signal to our malloc() 
implementation to free arena memory, etc. prior to the kernel oom kill.

> So, there isn't anything userland OOM handler can inherently do better
> and we can't do away with kernel handler no matter what.  On both
> accounts, it seems like the best course of action is making
> system-wide kernel OOM handler to make better decisions if possible at
> all.  If that's impossible, let's first think about why that's the
> case before hastly opening this new can of worms.
> 

We certainly can get away with the kernel oom killer in 99% of cases with 
this functionality for users who choose to have their own oom handling 
implementations.  We also can't possibly code every single handling policy 
into the kernel: we can't guarantee that our version of malloc() is 
guaranteed to be able to free memory back to the kernel when waking up on 
a memory.oom_control notification prior to the memcg oom killer killing 
something, for example, without this functionality.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>