Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory

Michal Hocko <mhocko@xxxxxxx> · Tue, 20 Jan 2015 17:31:38 +0100



On Tue 20-01-15 10:31:55, Johannes Weiner wrote:
> Introduce the basic control files to account, partition, and limit
> memory using cgroups in default hierarchy mode.
> 
> This interface versioning allows us to address fundamental design
> issues in the existing memory cgroup interface, further explained
> below.  The old interface will be maintained indefinitely, but a
> clearer model and improved workload performance should encourage
> existing users to switch over to the new one eventually.
> 
> The control files are thus:
> 
>   - memory.current shows the current consumption of the cgroup and its
>     descendants, in bytes.
> 
>   - memory.low configures the lower end of the cgroup's expected
>     memory consumption range.  The kernel considers memory below that
>     boundary to be a reserve - the minimum that the workload needs in
>     order to make forward progress - and generally avoids reclaiming
>     it, unless there is an imminent risk of entering an OOM situation.
> 
>   - memory.high configures the upper end of the cgroup's expected
>     memory consumption range.  A cgroup whose consumption grows beyond
>     this threshold is forced into direct reclaim, to work off the
>     excess and to throttle new allocations heavily, but is generally
>     allowed to continue and the OOM killer is not invoked.
> 
>   - memory.max configures the hard maximum amount of memory that the
>     cgroup is allowed to consume before the OOM killer is invoked.
> 
>   - memory.events shows event counters that indicate how often the
>     cgroup was reclaimed while below memory.low, how often it was
>     forced to reclaim excess beyond memory.high, how often it hit
>     memory.max, and how often it entered OOM due to memory.max.  This
>     allows users to identify configuration problems when observing a
>     degradation in workload performance.  An overcommitted system will
>     have an increased rate of low boundary breaches, whereas increased
>     rates of high limit breaches, maximum hits, or even OOM situations
>     will indicate internally overcommitted cgroups.
> 
> For existing users of memory cgroups, the following deviations from
> the current interface are worth pointing out and explaining:
> 
>   - The original lower boundary, the soft limit, is defined as a limit
>     that is per default unset.  As a result, the set of cgroups that
>     global reclaim prefers is opt-in, rather than opt-out.  The costs
>     for optimizing these mostly negative lookups are so high that the
>     implementation, despite its enormous size, does not even provide
>     the basic desirable behavior.  First off, the soft limit has no
>     hierarchical meaning.  All configured groups are organized in a
>     global rbtree and treated like equal peers, regardless where they
>     are located in the hierarchy.  This makes subtree delegation
>     impossible.  Second, the soft limit reclaim pass is so aggressive
>     that it not just introduces high allocation latencies into the
>     system, but also impacts system performance due to overreclaim, to
>     the point where the feature becomes self-defeating.
> 
>     The memory.low boundary on the other hand is a top-down allocated
>     reserve.  A cgroup enjoys reclaim protection when it and all its
>     ancestors are below their low boundaries, which makes delegation
>     of subtrees possible.  Secondly, new cgroups have no reserve per
>     default and in the common case most cgroups are eligible for the
>     preferred reclaim pass.  This allows the new low boundary to be
>     efficiently implemented with just a minor addition to the generic
>     reclaim code, without the need for out-of-band data structures and
>     reclaim passes.  Because the generic reclaim code considers all
>     cgroups except for the ones running low in the preferred first
>     reclaim pass, overreclaim of individual groups is eliminated as
>     well, resulting in much better overall workload performance.
> 
>   - The original high boundary, the hard limit, is defined as a strict
>     limit that can not budge, even if the OOM killer has to be called.
>     But this generally goes against the goal of making the most out of
>     the available memory.  The memory consumption of workloads varies
>     during runtime, and that requires users to overcommit.  But doing
>     that with a strict upper limit requires either a fairly accurate
>     prediction of the working set size or adding slack to the limit.
>     Since working set size estimation is hard and error prone, and
>     getting it wrong results in OOM kills, most users tend to err on
>     the side of a looser limit and end up wasting precious resources.
> 
>     The memory.high boundary on the other hand can be set much more
>     conservatively.  When hit, it throttles allocations by forcing
>     them into direct reclaim to work off the excess, but it never
>     invokes the OOM killer.  As a result, a high boundary that is
>     chosen too aggressively will not terminate the processes, but
>     instead it will lead to gradual performance degradation.  The user
>     can monitor this and make corrections until the minimal memory
>     footprint that still gives acceptable performance is found.
> 
>     In extreme cases, with many concurrent allocations and a complete
>     breakdown of reclaim progress within the group, the high boundary
>     can be exceeded.  But even then it's mostly better to satisfy the
>     allocation from the slack available in other groups or the rest of
>     the system than killing the group.  Otherwise, memory.max is there
>     to limit this type of spillover and ultimately contain buggy or
>     even malicious applications.
> 
>   - The original control file names are unwieldy and inconsistent in
>     many different ways.  For example, the upper boundary hit count is
>     exported in the memory.failcnt file, but an OOM event count has to
>     be manually counted by listening to memory.oom_control events, and
>     lower boundary / soft limit events have to be counted by first
>     setting a threshold for that value and then counting those events.
>     Also, usage and limit files encode their units in the filename.
>     That makes the filenames very long, even though this is not
>     information that a user needs to be reminded of every time they
>     type out those names.
> 
>     To address these naming issues, as well as to signal clearly that
>     the new interface carries a new configuration model, the naming
>     conventions in it necessarily differ from the old interface.
> 
>   - The original limit files indicate the state of an unset limit with
>     a very high number, and a configured limit can be unset by echoing
>     -1 into those files.  But that very high number is implementation
>     and architecture dependent and not very descriptive.  And while -1
>     can be understood as an underflow into the highest possible value,
>     -2 or -10M etc. do not work, so it's not inconsistent.
> 
>     memory.low, memory.high, and memory.max will use the string
>     "infinity" to indicate and set the highest possible value.
> 
> [akpm@xxxxxxxxxxxxxxxxxxxx: use seq_puts() for basic strings]
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxx>
> Cc: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx>
> Cc: Greg Thelen <gthelen@xxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>

Acked-by: Michal Hocko <mhocko@xxxxxxx>

> ---
>  Documentation/cgroups/unified-hierarchy.txt |  79 ++++++++++
>  include/linux/memcontrol.h                  |  32 ++++
>  mm/memcontrol.c                             | 229 ++++++++++++++++++++++++++--
>  mm/vmscan.c                                 |  22 ++-
>  4 files changed, 348 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
> index 4f4563277864..71daa35ec2d9 100644
> --- a/Documentation/cgroups/unified-hierarchy.txt
> +++ b/Documentation/cgroups/unified-hierarchy.txt
> @@ -327,6 +327,85 @@ supported and the interface files "release_agent" and
>  - use_hierarchy is on by default and the cgroup file for the flag is
>    not created.
>  
> +- The original lower boundary, the soft limit, is defined as a limit
> +  that is per default unset.  As a result, the set of cgroups that
> +  global reclaim prefers is opt-in, rather than opt-out.  The costs
> +  for optimizing these mostly negative lookups are so high that the
> +  implementation, despite its enormous size, does not even provide the
> +  basic desirable behavior.  First off, the soft limit has no
> +  hierarchical meaning.  All configured groups are organized in a
> +  global rbtree and treated like equal peers, regardless where they
> +  are located in the hierarchy.  This makes subtree delegation
> +  impossible.  Second, the soft limit reclaim pass is so aggressive
> +  that it not just introduces high allocation latencies into the
> +  system, but also impacts system performance due to overreclaim, to
> +  the point where the feature becomes self-defeating.
> +
> +  The memory.low boundary on the other hand is a top-down allocated
> +  reserve.  A cgroup enjoys reclaim protection when it and all its
> +  ancestors are below their low boundaries, which makes delegation of
> +  subtrees possible.  Secondly, new cgroups have no reserve per
> +  default and in the common case most cgroups are eligible for the
> +  preferred reclaim pass.  This allows the new low boundary to be
> +  efficiently implemented with just a minor addition to the generic
> +  reclaim code, without the need for out-of-band data structures and
> +  reclaim passes.  Because the generic reclaim code considers all
> +  cgroups except for the ones running low in the preferred first
> +  reclaim pass, overreclaim of individual groups is eliminated as
> +  well, resulting in much better overall workload performance.
> +
> +- The original high boundary, the hard limit, is defined as a strict
> +  limit that can not budge, even if the OOM killer has to be called.
> +  But this generally goes against the goal of making the most out of
> +  the available memory.  The memory consumption of workloads varies
> +  during runtime, and that requires users to overcommit.  But doing
> +  that with a strict upper limit requires either a fairly accurate
> +  prediction of the working set size or adding slack to the limit.
> +  Since working set size estimation is hard and error prone, and
> +  getting it wrong results in OOM kills, most users tend to err on the
> +  side of a looser limit and end up wasting precious resources.
> +
> +  The memory.high boundary on the other hand can be set much more
> +  conservatively.  When hit, it throttles allocations by forcing them
> +  into direct reclaim to work off the excess, but it never invokes the
> +  OOM killer.  As a result, a high boundary that is chosen too
> +  aggressively will not terminate the processes, but instead it will
> +  lead to gradual performance degradation.  The user can monitor this
> +  and make corrections until the minimal memory footprint that still
> +  gives acceptable performance is found.
> +
> +  In extreme cases, with many concurrent allocations and a complete
> +  breakdown of reclaim progress within the group, the high boundary
> +  can be exceeded.  But even then it's mostly better to satisfy the
> +  allocation from the slack available in other groups or the rest of
> +  the system than killing the group.  Otherwise, memory.max is there
> +  to limit this type of spillover and ultimately contain buggy or even
> +  malicious applications.
> +
> +- The original control file names are unwieldy and inconsistent in
> +  many different ways.  For example, the upper boundary hit count is
> +  exported in the memory.failcnt file, but an OOM event count has to
> +  be manually counted by listening to memory.oom_control events, and
> +  lower boundary / soft limit events have to be counted by first
> +  setting a threshold for that value and then counting those events.
> +  Also, usage and limit files encode their units in the filename.
> +  That makes the filenames very long, even though this is not
> +  information that a user needs to be reminded of every time they type
> +  out those names.
> +
> +  To address these naming issues, as well as to signal clearly that
> +  the new interface carries a new configuration model, the naming
> +  conventions in it necessarily differ from the old interface.
> +
> +- The original limit files indicate the state of an unset limit with a
> +  Very High Number, and a configured limit can be unset by echoing -1
> +  into those files.  But that very high number is implementation and
> +  architecture dependent and not very descriptive.  And while -1 can
> +  be understood as an underflow into the highest possible value, -2 or
> +  -10M etc. do not work, so it's not consistent.
> +
> +  memory.low, memory.high, and memory.max will use the string
> +  "infinity" to indicate and set the highest possible value.
>  
>  5. Planned Changes
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 76f489fad640..72dff5fb0d0c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,27 @@ struct mem_cgroup_reclaim_cookie {
>  	unsigned int generation;
>  };
>  
> +enum mem_cgroup_events_index {
> +	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
> +	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
> +	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
> +	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
> +	MEM_CGROUP_EVENTS_NSTATS,
> +	/* default hierarchy events */
> +	MEMCG_LOW = MEM_CGROUP_EVENTS_NSTATS,
> +	MEMCG_HIGH,
> +	MEMCG_MAX,
> +	MEMCG_OOM,
> +	MEMCG_NR_EVENTS,
> +};
> +
>  #ifdef CONFIG_MEMCG
> +void mem_cgroup_events(struct mem_cgroup *memcg,
> +		       enum mem_cgroup_events_index idx,
> +		       unsigned int nr);
> +
> +bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
> +
>  int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>  			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
>  void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> @@ -175,6 +195,18 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
>  
> +static inline void mem_cgroup_events(struct mem_cgroup *memcg,
> +				     enum mem_cgroup_events_index idx,
> +				     unsigned int nr)
> +{
> +}
> +
> +static inline bool mem_cgroup_low(struct mem_cgroup *root,
> +				  struct mem_cgroup *memcg)
> +{
> +	return false;
> +}
> +
>  static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>  					gfp_t gfp_mask,
>  					struct mem_cgroup **memcgp)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a3592a756ad9..5730886e3b0e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -97,14 +97,6 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"swap",
>  };
>  
> -enum mem_cgroup_events_index {
> -	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
> -	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
> -	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
> -	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
> -	MEM_CGROUP_EVENTS_NSTATS,
> -};
> -
>  static const char * const mem_cgroup_events_names[] = {
>  	"pgpgin",
>  	"pgpgout",
> @@ -138,7 +130,7 @@ enum mem_cgroup_events_target {
>  
>  struct mem_cgroup_stat_cpu {
>  	long count[MEM_CGROUP_STAT_NSTATS];
> -	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
> +	unsigned long events[MEMCG_NR_EVENTS];
>  	unsigned long nr_page_events;
>  	unsigned long targets[MEM_CGROUP_NTARGETS];
>  };
> @@ -284,6 +276,10 @@ struct mem_cgroup {
>  	struct page_counter memsw;
>  	struct page_counter kmem;
>  
> +	/* Normal memory consumption range */
> +	unsigned long low;
> +	unsigned long high;
> +
>  	unsigned long soft_limit;
>  
>  	/* vmpressure notifications */
> @@ -2327,6 +2323,8 @@ retry:
>  	if (!(gfp_mask & __GFP_WAIT))
>  		goto nomem;
>  
> +	mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
> +
>  	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
>  						    gfp_mask, may_swap);
>  
> @@ -2368,6 +2366,8 @@ retry:
>  	if (fatal_signal_pending(current))
>  		goto bypass;
>  
> +	mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1);
> +
>  	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages));
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
> @@ -2379,6 +2379,16 @@ done_restock:
>  	css_get_many(&memcg->css, batch);
>  	if (batch > nr_pages)
>  		refill_stock(memcg, batch - nr_pages);
> +	/*
> +	 * If the hierarchy is above the normal consumption range,
> +	 * make the charging task trim their excess contribution.
> +	 */
> +	do {
> +		if (page_counter_read(&memcg->memory) <= memcg->high)
> +			continue;
> +		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> +		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
> +	} while ((memcg = parent_mem_cgroup(memcg)));
>  done:
>  	return ret;
>  }
> @@ -4304,7 +4314,7 @@ out_kfree:
>  	return ret;
>  }
>  
> -static struct cftype mem_cgroup_files[] = {
> +static struct cftype mem_cgroup_legacy_files[] = {
>  	{
>  		.name = "usage_in_bytes",
>  		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> @@ -4580,6 +4590,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	if (parent_css == NULL) {
>  		root_mem_cgroup = memcg;
>  		page_counter_init(&memcg->memory, NULL);
> +		memcg->high = PAGE_COUNTER_MAX;
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
>  		page_counter_init(&memcg->memsw, NULL);
>  		page_counter_init(&memcg->kmem, NULL);
> @@ -4625,6 +4636,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  
>  	if (parent->use_hierarchy) {
>  		page_counter_init(&memcg->memory, &parent->memory);
> +		memcg->high = PAGE_COUNTER_MAX;
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
>  		page_counter_init(&memcg->memsw, &parent->memsw);
>  		page_counter_init(&memcg->kmem, &parent->kmem);
> @@ -4635,6 +4647,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  		 */
>  	} else {
>  		page_counter_init(&memcg->memory, NULL);
> +		memcg->high = PAGE_COUNTER_MAX;
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
>  		page_counter_init(&memcg->memsw, NULL);
>  		page_counter_init(&memcg->kmem, NULL);
> @@ -4710,6 +4723,8 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
>  	mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
>  	mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
>  	memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
> +	memcg->low = 0;
> +	memcg->high = PAGE_COUNTER_MAX;
>  	memcg->soft_limit = PAGE_COUNTER_MAX;
>  }
>  
> @@ -5296,6 +5311,147 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
>  		mem_cgroup_from_css(root_css)->use_hierarchy = true;
>  }
>  
> +static u64 memory_current_read(struct cgroup_subsys_state *css,
> +			       struct cftype *cft)
> +{
> +	return mem_cgroup_usage(mem_cgroup_from_css(css), false);
> +}
> +
> +static int memory_low_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +	unsigned long low = ACCESS_ONCE(memcg->low);
> +
> +	if (low == PAGE_COUNTER_MAX)
> +		seq_puts(m, "infinity\n");
> +	else
> +		seq_printf(m, "%llu\n", (u64)low * PAGE_SIZE);
> +
> +	return 0;
> +}
> +
> +static ssize_t memory_low_write(struct kernfs_open_file *of,
> +				char *buf, size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned long low;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	err = page_counter_memparse(buf, "infinity", &low);
> +	if (err)
> +		return err;
> +
> +	memcg->low = low;
> +
> +	return nbytes;
> +}
> +
> +static int memory_high_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +	unsigned long high = ACCESS_ONCE(memcg->high);
> +
> +	if (high == PAGE_COUNTER_MAX)
> +		seq_puts(m, "infinity\n");
> +	else
> +		seq_printf(m, "%llu\n", (u64)high * PAGE_SIZE);
> +
> +	return 0;
> +}
> +
> +static ssize_t memory_high_write(struct kernfs_open_file *of,
> +				 char *buf, size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned long high;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	err = page_counter_memparse(buf, "infinity", &high);
> +	if (err)
> +		return err;
> +
> +	memcg->high = high;
> +
> +	return nbytes;
> +}
> +
> +static int memory_max_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +	unsigned long max = ACCESS_ONCE(memcg->memory.limit);
> +
> +	if (max == PAGE_COUNTER_MAX)
> +		seq_puts(m, "infinity\n");
> +	else
> +		seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE);
> +
> +	return 0;
> +}
> +
> +static ssize_t memory_max_write(struct kernfs_open_file *of,
> +				char *buf, size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned long max;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	err = page_counter_memparse(buf, "infinity", &max);
> +	if (err)
> +		return err;
> +
> +	err = mem_cgroup_resize_limit(memcg, max);
> +	if (err)
> +		return err;
> +
> +	return nbytes;
> +}
> +
> +static int memory_events_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +
> +	seq_printf(m, "low %lu\n", mem_cgroup_read_events(memcg, MEMCG_LOW));
> +	seq_printf(m, "high %lu\n", mem_cgroup_read_events(memcg, MEMCG_HIGH));
> +	seq_printf(m, "max %lu\n", mem_cgroup_read_events(memcg, MEMCG_MAX));
> +	seq_printf(m, "oom %lu\n", mem_cgroup_read_events(memcg, MEMCG_OOM));
> +
> +	return 0;
> +}
> +
> +static struct cftype memory_files[] = {
> +	{
> +		.name = "current",
> +		.read_u64 = memory_current_read,
> +	},
> +	{
> +		.name = "low",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_low_show,
> +		.write = memory_low_write,
> +	},
> +	{
> +		.name = "high",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_high_show,
> +		.write = memory_high_write,
> +	},
> +	{
> +		.name = "max",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_max_show,
> +		.write = memory_max_write,
> +	},
> +	{
> +		.name = "events",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_events_show,
> +	},
> +	{ }	/* terminate */
> +};
> +
>  struct cgroup_subsys memory_cgrp_subsys = {
>  	.css_alloc = mem_cgroup_css_alloc,
>  	.css_online = mem_cgroup_css_online,
> @@ -5306,7 +5462,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
>  	.cancel_attach = mem_cgroup_cancel_attach,
>  	.attach = mem_cgroup_move_task,
>  	.bind = mem_cgroup_bind,
> -	.legacy_cftypes = mem_cgroup_files,
> +	.dfl_cftypes = memory_files,
> +	.legacy_cftypes = mem_cgroup_legacy_files,
>  	.early_init = 0,
>  };
>  
> @@ -5341,6 +5498,56 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>  
> +/**
> + * mem_cgroup_events - count memory events against a cgroup
> + * @memcg: the memory cgroup
> + * @idx: the event index
> + * @nr: the number of events to account for
> + */
> +void mem_cgroup_events(struct mem_cgroup *memcg,
> +		       enum mem_cgroup_events_index idx,
> +		       unsigned int nr)
> +{
> +	this_cpu_add(memcg->stat->events[idx], nr);
> +}
> +
> +/**
> + * mem_cgroup_low - check if memory consumption is below the normal range
> + * @root: the highest ancestor to consider
> + * @memcg: the memory cgroup to check
> + *
> + * Returns %true if memory consumption of @memcg, and that of all
> + * configurable ancestors up to @root, is below the normal range.
> + */
> +bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
> +{
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	/*
> +	 * The toplevel group doesn't have a configurable range, so
> +	 * it's never low when looked at directly, and it is not
> +	 * considered an ancestor when assessing the hierarchy.
> +	 */
> +
> +	if (memcg == root_mem_cgroup)
> +		return false;
> +
> +	if (page_counter_read(&memcg->memory) > memcg->low)
> +		return false;
> +
> +	while (memcg != root) {
> +		memcg = parent_mem_cgroup(memcg);
> +
> +		if (memcg == root_mem_cgroup)
> +			break;
> +
> +		if (page_counter_read(&memcg->memory) > memcg->low)
> +			return false;
> +	}
> +	return true;
> +}
> +
>  #ifdef CONFIG_MEMCG_SWAP
>  /**
>   * mem_cgroup_swapout - transfer a memsw charge to swap
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b89097185f46..f62ec654d4c5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -91,6 +91,9 @@ struct scan_control {
>  	/* Can pages be swapped as part of reclaim? */
>  	unsigned int may_swap:1;
>  
> +	/* Can cgroups be reclaimed below their normal consumption range? */
> +	unsigned int may_thrash:1;
> +
>  	unsigned int hibernation_mode:1;
>  
>  	/* One of the zones is ready for compaction */
> @@ -2333,6 +2336,12 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			struct lruvec *lruvec;
>  			int swappiness;
>  
> +			if (mem_cgroup_low(root, memcg)) {
> +				if (!sc->may_thrash)
> +					continue;
> +				mem_cgroup_events(memcg, MEMCG_LOW, 1);
> +			}
> +
>  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  			swappiness = mem_cgroup_swappiness(memcg);
>  			scanned = sc->nr_scanned;
> @@ -2360,8 +2369,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  				mem_cgroup_iter_break(root, memcg);
>  				break;
>  			}
> -			memcg = mem_cgroup_iter(root, memcg, &reclaim);
> -		} while (memcg);
> +		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
>  		/*
>  		 * Shrink the slab caches in the same proportion that
> @@ -2559,10 +2567,11 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  					  struct scan_control *sc)
>  {
> +	int initial_priority = sc->priority;
>  	unsigned long total_scanned = 0;
>  	unsigned long writeback_threshold;
>  	bool zones_reclaimable;
> -
> +retry:
>  	delayacct_freepages_start();
>  
>  	if (global_reclaim(sc))
> @@ -2612,6 +2621,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  	if (sc->compaction_ready)
>  		return 1;
>  
> +	/* Untapped cgroup reserves?  Don't OOM, retry. */
> +	if (!sc->may_thrash) {
> +		sc->priority = initial_priority;
> +		sc->may_thrash = 1;
> +		goto retry;
> +	}
> +
>  	/* Any of the zones still reclaimable?  Don't OOM. */
>  	if (zones_reclaimable)
>  		return 1;
> -- 
> 2.2.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>