Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Mon, 12 Jan 2015 15:37:16 -0800

On Thu,  8 Jan 2015 23:15:04 -0500 Johannes Weiner <hannes@xxxxxxxxxxx> wrote:

> Introduce the basic control files to account, partition, and limit
> memory using cgroups in default hierarchy mode.
> 
> This interface versioning allows us to address fundamental design
> issues in the existing memory cgroup interface, further explained
> below.  The old interface will be maintained indefinitely, but a
> clearer model and improved workload performance should encourage
> existing users to switch over to the new one eventually.
> 
> The control files are thus:
> 
>   - memory.current shows the current consumption of the cgroup and its
>     descendants, in bytes.
> 
>   - memory.low configures the lower end of the cgroup's expected
>     memory consumption range.  The kernel considers memory below that
>     boundary to be a reserve - the minimum that the workload needs in
>     order to make forward progress - and generally avoids reclaiming
>     it, unless there is an imminent risk of entering an OOM situation.

The code appears to be ascribing a special meaning to low==0: you can
write "none" to set this.  But I'm not seeing any description of this?

>   - memory.high configures the upper end of the cgroup's expected
>     memory consumption range.  A cgroup whose consumption grows beyond
>     this threshold is forced into direct reclaim, to work off the
>     excess and to throttle new allocations heavily, but is generally
>     allowed to continue and the OOM killer is not invoked.
> 
>   - memory.max configures the hard maximum amount of memory that the
>     cgroup is allowed to consume before the OOM killer is invoked.
> 
>   - memory.events shows event counters that indicate how often the
>     cgroup was reclaimed while below memory.low, how often it was
>     forced to reclaim excess beyond memory.high, how often it hit
>     memory.max, and how often it entered OOM due to memory.max.  This
>     allows users to identify configuration problems when observing a
>     degradation in workload performance.  An overcommitted system will
>     have an increased rate of low boundary breaches, whereas increased
>     rates of high limit breaches, maximum hits, or even OOM situations
>     will indicate internally overcommitted cgroups.
> 
> For existing users of memory cgroups, the following deviations from
> the current interface are worth pointing out and explaining:
> 
>   - The original lower boundary, the soft limit, is defined as a limit
>     that is per default unset.  As a result, the set of cgroups that
>     global reclaim prefers is opt-in, rather than opt-out.  The costs
>     for optimizing these mostly negative lookups are so high that the
>     implementation, despite its enormous size, does not even provide
>     the basic desirable behavior.  First off, the soft limit has no
>     hierarchical meaning.  All configured groups are organized in a
>     global rbtree and treated like equal peers, regardless where they
>     are located in the hierarchy.  This makes subtree delegation
>     impossible.  Second, the soft limit reclaim pass is so aggressive
>     that it not just introduces high allocation latencies into the
>     system, but also impacts system performance due to overreclaim, to
>     the point where the feature becomes self-defeating.
> 
>     The memory.low boundary on the other hand is a top-down allocated
>     reserve.  A cgroup enjoys reclaim protection when it and all its
>     ancestors are below their low boundaries, which makes delegation
>     of subtrees possible.  Secondly, new cgroups have no reserve per
>     default and in the common case most cgroups are eligible for the
>     preferred reclaim pass.  This allows the new low boundary to be
>     efficiently implemented with just a minor addition to the generic
>     reclaim code, without the need for out-of-band data structures and
>     reclaim passes.  Because the generic reclaim code considers all
>     cgroups except for the ones running low in the preferred first
>     reclaim pass, overreclaim of individual groups is eliminated as
>     well, resulting in much better overall workload performance.
> 
>   - The original high boundary, the hard limit, is defined as a strict
>     limit that can not budge, even if the OOM killer has to be called.
>     But this generally goes against the goal of making the most out of
>     the available memory.  The memory consumption of workloads varies
>     during runtime, and that requires users to overcommit.  But doing
>     that with a strict upper limit requires either a fairly accurate
>     prediction of the working set size or adding slack to the limit.
>     Since working set size estimation is hard and error prone, and
>     getting it wrong results in OOM kills, most users tend to err on
>     the side of a looser limit and end up wasting precious resources.
> 
>     The memory.high boundary on the other hand can be set much more
>     conservatively.  When hit, it throttles allocations by forcing
>     them into direct reclaim to work off the excess, but it never
>     invokes the OOM killer.  As a result, a high boundary that is
>     chosen too aggressively will not terminate the processes, but
>     instead it will lead to gradual performance degradation.  The user
>     can monitor this and make corrections until the minimal memory
>     footprint that still gives acceptable performance is found.
> 
>     In extreme cases, with many concurrent allocations and a complete
>     breakdown of reclaim progress within the group, the high boundary
>     can be exceeded.  But even then it's mostly better to satisfy the
>     allocation from the slack available in other groups or the rest of
>     the system than killing the group.  Otherwise, memory.max is there
>     to limit this type of spillover and ultimately contain buggy or
>     even malicious applications.
> 
>   - The existing control file names are unwieldy and inconsistent in
>     many different ways.  For example, the upper boundary hit count is
>     exported in the memory.failcnt file, but an OOM event count has to
>     be manually counted by listening to memory.oom_control events, and
>     lower boundary / soft limit events have to be counted by first
>     setting a threshold for that value and then counting those events.
>     Also, usage and limit files encode their units in the filename.
>     That makes the filenames very long, even though this is not
>     information that a user needs to be reminded of every time they
>     type out those names.
> 
>     To address these naming issues, as well as to signal clearly that
>     the new interface carries a new configuration model, the naming
>     conventions in it necessarily differ from the old interface.

This all sounds pretty major.  How much trouble is this change likely to
cause existing memcg users?

> include/linux/memcontrol.h |  32 ++++++
> mm/memcontrol.c            | 247 +++++++++++++++++++++++++++++++++++++++++++--
> mm/vmscan.c                |  22 +++-

No Documentation/cgroups/memory.txt?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>