Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

Tim Hockin <thockin@xxxxxxxxxx> · Sat, 7 Dec 2013 13:04:36 -0800

We have hierarchical "containers".  Jobs exist in these containers.  The containers can hold sub-containers.
In case of system OOM we want to kill in strict priority order.  From the root of the hierarchy, choose the lowest priority.  This could be a task or a memcg.  If a memcg, recurse.  
We CAN do it in kernel (in fact we do, and I argued for that, and David acquiesced).  But doing it in kernel means changes are slow and risky.
What we really have is a bunch of features that we offer to our users that need certain OOM-time behaviors and guarantees to be implemented.  I don't expect that most of our changes are useful for anyone outside of Google, really. They come with a lot of environmental assumptions.  This is why David finally convinced me it was easier to release changes, to fix bugs, and to update kernels if we do this in userspace.

I apologize if I am not giving you what you want.  I am typing on a phone at the moment.  If this still doesn't help I can try from a computer later.
Tim
On Dec 7, 2013 11:07 AM, "Johannes Weiner" <hannes@xxxxxxxxxxx> wrote:

On Sat, Dec 07, 2013 at 10:12:19AM -0800, Tim Hockin wrote:

> You more or less described the fundamental change - a score per memcg, with

> a recursive OOM killer which evaluates scores between siblings at the same

> level.

>

> It gets a bit complicated because we have need if wider scoring ranges than

> are provided by default

If so, I'm sure you can make a convincing case to widen the internal

per-task score ranges.  The per-memcg score ranges have not even be

defined, so this is even easier.

> and because we score PIDs against mcgs at a given scope.

You are describing bits of a solution, not a problem.  And I can't

possibly infer a problem from this.

> We also have some tiebreaker heuristic (age).

Either periodically update the per-memcg score from userspace or

implement this in the kernel.  We have considered CPU usage

history/runtime etc. in the past when picking an OOM victim task.

But I'm again just speculating what your problem is, so this may or

may not be a feasible solution.

> We also have a handful of features that depend on OOM handling like the

> aforementioned automatically growing and changing the actual OOM score

> depending on usage in relation to various thresholds ( e.g. we sold you X,

> and we allow you to go over X but if you do, your likelihood of death in

> case of system OOM goes up.

You can trivially monitor threshold events from userspace with the

existing infrastructure and accordingly update the per-memcg score.

> Do you really want us to teach the kernel policies like this?  It would be

> way easier to do and test in userspace.

Maybe.  Providing fragments of your solution is not an efficient way

to communicate the problem.  And you have to sell the problem before

anybody can be expected to even consider your proposal as one of the

possible solutions.