On Wed, Oct 19, 2011 at 6:33 PM, Michal Hocko <mhocko@xxxxxxx> wrote: > Hi all, > this is a request for discussion (I hope we can touch this during memcg > meeting during the upcoming KS). I have brought this up earlier this > year before LSF (http://thread.gmane.org/gmane.linux.kernel.mm/60464). > The patch got much smaller since then due to excellent Johannes' memcg > naturalization work (http://thread.gmane.org/gmane.linux.kernel.mm/68724) > which this is based on. > I realize that this will be controversial but I would like to hear > whether this is strictly no-go or whether we can go that direction (the > implementation might differ of course). > > The patch is still half baked but I guess it should be sufficient to > show what I am trying to achieve. > The basic idea is that memcgs would get a new attribute (isolated) which > would control whether that group should be considered during global > reclaim. > This means that we could achieve a certain memory isolation for > processes in the group from the rest of the system activity which has > been traditionally done by mlocking the important parts of memory. > This approach, however, has some advantages. First of all, it is a kind > of all or nothing type of approach. Either the memory is important and > mlocked or you have no guarantee that it keeps resident. > Secondly it is much more prone to OOM situation. > Let's consider a case where a memory is evictable in theory but you > would pay quite much if you have to get it back resident (pre calculated > data from database - e.g. reports). The memory wouldn't be used very > often so it would be a number one candidate to evict after some time. > We would want to have something like a clever mlock in such a case which > would evict that memory only if the cgroup itself gets under memory > pressure (e.g. peak workload). This is not hard to do if we are not > over committing the memory but things get tricky otherwise. > With the isolated memcgs we get exactly such a guarantee because we would > reclaim such a memory only from the hard limit reclaim paths or if the > soft limit reclaim if it is set up. > > Any thoughts comments? > > --- > From: Michal Hocko <mhocko@xxxxxxx> > Subject: Implement isolated cgroups > > This patch adds a new per-cgroup knob (isolated) which controls whether > pages charged for the group should be considered for the global reclaim > or they are reclaimed only during soft reclaim and under per-cgroup > memory pressure. > > The value can be modified by GROUP/memory.isolated knob. > > The primary idea behind isolated cgroups is in a better isolation of a group > from the global system activity. At the moment, memory cgroups are mainly > used to throttle processes in a group by placing a cap on their memory > usage. However, mem. cgroups don't protect their (charged) memory from being > evicted by the global reclaim as groups are considered during global > reclaim. > > The feature will provide an easy way to setup a mission critical workload in > the memory isolated environment without necessity of mlock. Due to > per-cgroup reclaim we can even handle memory usage spikes much more > gracefully because a part of the working set can get reclaimed (unlike OOM > killed as if mlock has been used). So we can look at the feature as an > intelligent mlock (protect from external memory pressure and reclaim on > internal pressure). > > The implementation ignores isolated group status for the soft reclaim which > means that every isolated group can configure how much memory it can > sacrifice under global memory pressure. Soft unlimited groups are isolated > from the global memory pressure completely. > > Please note that the feature has to be used with caution because isolated > groups will make a bigger reclaim pressure to non-isolated cgroups. > > Implementation is really simple because we just have to hook into shrink_zone > and exclude isolated groups if we are doing the global reclaiming. > > Signed-off-by: Michal Hocko <mhocko@xxxxxxx> > > TODO > - consider hierarchies - I am not sure whether we want to have > non-consistent isolated status in the hierarchy - probably not > - handle root cgroup > - Do we want some checks whether the current setting is safe? > - is bool sufficient. Don't we rather want something like priority > instead? > > > include/linux/memcontrol.h | 7 +++++++ > mm/memcontrol.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ > mm/vmscan.c | 8 +++++++- > 3 files changed, 58 insertions(+), 1 deletion(-) > > Index: linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/mm/memcontrol.c > =================================================================== > --- linux-3.1-rc4-next-20110831-mmotm-isolated-memcg.orig/mm/memcontrol.c > +++ linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/mm/memcontrol.c > @@ -258,6 +258,9 @@ struct mem_cgroup { > /* set when res.limit == memsw.limit */ > bool memsw_is_minimum; > > + /* is the group isolated from the global memory pressure? */ > + bool isolated; > + > /* protect arrays of thresholds */ > struct mutex thresholds_lock; > > @@ -287,6 +290,11 @@ struct mem_cgroup { > spinlock_t pcp_counter_lock; > }; > > +bool mem_cgroup_isolated(struct mem_cgroup *mem) > +{ > + return mem->isolated; > +} > + > /* Stuffs for move charges at task migration. */ > /* > * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a > @@ -4561,6 +4569,37 @@ static int mem_control_numa_stat_open(st > } > #endif /* CONFIG_NUMA */ > > +static int mem_cgroup_isolated_write(struct cgroup *cgrp, struct cftype *cft, > + const char *buffer) > +{ > + int ret = -EINVAL; > + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); > + > + if (mem_cgroup_is_root(mem)) > + goto out; > + > + if (!strcasecmp(buffer, "true")) > + mem->isolated = true; > + else if (!strcasecmp(buffer, "false")) > + mem->isolated = false; > + else > + goto out; > + > + ret = 0; > +out: > + return ret; > +} > + > +static int mem_cgroup_isolated_read(struct cgroup *cgrp, struct cftype *cft, > + struct seq_file *seq) > +{ > + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); > + > + seq_puts(seq, (mem->isolated)?"true":"false"); > + > + return 0; > +} > + > static struct cftype mem_cgroup_files[] = { > { > .name = "usage_in_bytes", > @@ -4624,6 +4663,11 @@ static struct cftype mem_cgroup_files[] > .unregister_event = mem_cgroup_oom_unregister_event, > .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), > }, > + { > + .name = "isolated", > + .write_string = mem_cgroup_isolated_write, > + .read_seq_string = mem_cgroup_isolated_read, > + }, > #ifdef CONFIG_NUMA > { > .name = "numa_stat", > Index: linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/include/linux/memcontrol.h > =================================================================== > --- linux-3.1-rc4-next-20110831-mmotm-isolated-memcg.orig/include/linux/memcontrol.h > +++ linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/include/linux/memcontrol.h > @@ -165,6 +165,9 @@ void mem_cgroup_split_huge_fixup(struct > bool mem_cgroup_bad_page_check(struct page *page); > void mem_cgroup_print_bad_page(struct page *page); > #endif > + > +bool mem_cgroup_isolated(struct mem_cgroup *mem); > + > #else /* CONFIG_CGROUP_MEM_RES_CTLR */ > struct mem_cgroup; > > @@ -382,6 +385,10 @@ static inline > void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx) > { > } > +bool mem_cgroup_isolated(struct mem_cgroup *mem) > +{ > + return false; > +} > #endif /* CONFIG_CGROUP_MEM_CONT */ > > #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM) > Index: linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/mm/vmscan.c > =================================================================== > --- linux-3.1-rc4-next-20110831-mmotm-isolated-memcg.orig/mm/vmscan.c > +++ linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/mm/vmscan.c > @@ -2109,7 +2109,13 @@ static void shrink_zone(int priority, st > .zone = zone, > }; > > - shrink_mem_cgroup_zone(priority, &mz, sc); > + /* > + * Do not reclaim from an isolated group if we are in > + * the global reclaim. > + */ > + if (!(mem_cgroup_isolated(mem) && global_reclaim(sc))) > + shrink_mem_cgroup_zone(priority, &mz, sc); > + > /* > * Limit reclaim has historically picked one memcg and > * scanned it with decreasing priority levels until > -- > Michal Hocko > SUSE Labs > SUSE LINUX s.r.o. > Lihovarska 1060/12 > 190 00 Praha 9 > Czech Republic > Hi Michal: I didn't read through the patch itself but only the description. If we wanna protect a memcg being reclaimed from under global memory pressure, I think we can approach it by making change on soft_limit reclaim. I have a soft_limit change built on top of Johannes's patchset, which does basically soft_limit aware reclaim under global memory pressure. The implementation is simple, and I am looking forward to discuss more with you guys in the conference. --Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href