Re: [patch 7/8] mm, memcg: allow processes handling oom notifications to access reserves

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We have hierarchical "containers".  Jobs exist in these containers.  The containers can hold sub-containers.

In case of system OOM we want to kill in strict priority order.  From the root of the hierarchy, choose the lowest priority.  This could be a task or a memcg.  If a memcg, recurse. 

We CAN do it in kernel (in fact we do, and I argued for that, and David acquiesced).  But doing it in kernel means changes are slow and risky.

What we really have is a bunch of features that we offer to our users that need certain OOM-time behaviors and guarantees to be implemented.  I don't expect that most of our changes are useful for anyone outside of Google, really. They come with a lot of environmental assumptions.  This is why David finally convinced me it was easier to release changes, to fix bugs, and to update kernels if we do this in userspace.

I apologize if I am not giving you what you want.  I am typing on a phone at the moment.  If this still doesn't help I can try from a computer later.

Tim

On Dec 7, 2013 11:07 AM, "Johannes Weiner" <hannes@xxxxxxxxxxx> wrote:
On Sat, Dec 07, 2013 at 10:12:19AM -0800, Tim Hockin wrote:
> You more or less described the fundamental change - a score per memcg, with
> a recursive OOM killer which evaluates scores between siblings at the same
> level.
>
> It gets a bit complicated because we have need if wider scoring ranges than
> are provided by default

If so, I'm sure you can make a convincing case to widen the internal
per-task score ranges.  The per-memcg score ranges have not even be
defined, so this is even easier.

> and because we score PIDs against mcgs at a given scope.

You are describing bits of a solution, not a problem.  And I can't
possibly infer a problem from this.

> We also have some tiebreaker heuristic (age).

Either periodically update the per-memcg score from userspace or
implement this in the kernel.  We have considered CPU usage
history/runtime etc. in the past when picking an OOM victim task.

But I'm again just speculating what your problem is, so this may or
may not be a feasible solution.

> We also have a handful of features that depend on OOM handling like the
> aforementioned automatically growing and changing the actual OOM score
> depending on usage in relation to various thresholds ( e.g. we sold you X,
> and we allow you to go over X but if you do, your likelihood of death in
> case of system OOM goes up.

You can trivially monitor threshold events from userspace with the
existing infrastructure and accordingly update the per-memcg score.

> Do you really want us to teach the kernel policies like this?  It would be
> way easier to do and test in userspace.

Maybe.  Providing fragments of your solution is not an efficient way
to communicate the problem.  And you have to sell the problem before
anybody can be expected to even consider your proposal as one of the
possible solutions.

[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]