Hi all, this is a request for discussion (I hope we can touch this during memcg meeting during the upcoming KS). I have brought this up earlier this year before LSF (http://thread.gmane.org/gmane.linux.kernel.mm/60464). The patch got much smaller since then due to excellent Johannes' memcg naturalization work (http://thread.gmane.org/gmane.linux.kernel.mm/68724) which this is based on. I realize that this will be controversial but I would like to hear whether this is strictly no-go or whether we can go that direction (the implementation might differ of course). The patch is still half baked but I guess it should be sufficient to show what I am trying to achieve. The basic idea is that memcgs would get a new attribute (isolated) which would control whether that group should be considered during global reclaim. This means that we could achieve a certain memory isolation for processes in the group from the rest of the system activity which has been traditionally done by mlocking the important parts of memory. This approach, however, has some advantages. First of all, it is a kind of all or nothing type of approach. Either the memory is important and mlocked or you have no guarantee that it keeps resident. Secondly it is much more prone to OOM situation. Let's consider a case where a memory is evictable in theory but you would pay quite much if you have to get it back resident (pre calculated data from database - e.g. reports). The memory wouldn't be used very often so it would be a number one candidate to evict after some time. We would want to have something like a clever mlock in such a case which would evict that memory only if the cgroup itself gets under memory pressure (e.g. peak workload). This is not hard to do if we are not over committing the memory but things get tricky otherwise. With the isolated memcgs we get exactly such a guarantee because we would reclaim such a memory only from the hard limit reclaim paths or if the soft limit reclaim if it is set up. Any thoughts comments? --- From: Michal Hocko <mhocko@xxxxxxx> Subject: Implement isolated cgroups This patch adds a new per-cgroup knob (isolated) which controls whether pages charged for the group should be considered for the global reclaim or they are reclaimed only during soft reclaim and under per-cgroup memory pressure. The value can be modified by GROUP/memory.isolated knob. The primary idea behind isolated cgroups is in a better isolation of a group from the global system activity. At the moment, memory cgroups are mainly used to throttle processes in a group by placing a cap on their memory usage. However, mem. cgroups don't protect their (charged) memory from being evicted by the global reclaim as groups are considered during global reclaim. The feature will provide an easy way to setup a mission critical workload in the memory isolated environment without necessity of mlock. Due to per-cgroup reclaim we can even handle memory usage spikes much more gracefully because a part of the working set can get reclaimed (unlike OOM killed as if mlock has been used). So we can look at the feature as an intelligent mlock (protect from external memory pressure and reclaim on internal pressure). The implementation ignores isolated group status for the soft reclaim which means that every isolated group can configure how much memory it can sacrifice under global memory pressure. Soft unlimited groups are isolated from the global memory pressure completely. Please note that the feature has to be used with caution because isolated groups will make a bigger reclaim pressure to non-isolated cgroups. Implementation is really simple because we just have to hook into shrink_zone and exclude isolated groups if we are doing the global reclaiming. Signed-off-by: Michal Hocko <mhocko@xxxxxxx> TODO - consider hierarchies - I am not sure whether we want to have non-consistent isolated status in the hierarchy - probably not - handle root cgroup - Do we want some checks whether the current setting is safe? - is bool sufficient. Don't we rather want something like priority instead? include/linux/memcontrol.h | 7 +++++++ mm/memcontrol.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 8 +++++++- 3 files changed, 58 insertions(+), 1 deletion(-) Index: linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/mm/memcontrol.c =================================================================== --- linux-3.1-rc4-next-20110831-mmotm-isolated-memcg.orig/mm/memcontrol.c +++ linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/mm/memcontrol.c @@ -258,6 +258,9 @@ struct mem_cgroup { /* set when res.limit == memsw.limit */ bool memsw_is_minimum; + /* is the group isolated from the global memory pressure? */ + bool isolated; + /* protect arrays of thresholds */ struct mutex thresholds_lock; @@ -287,6 +290,11 @@ struct mem_cgroup { spinlock_t pcp_counter_lock; }; +bool mem_cgroup_isolated(struct mem_cgroup *mem) +{ + return mem->isolated; +} + /* Stuffs for move charges at task migration. */ /* * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a @@ -4561,6 +4569,37 @@ static int mem_control_numa_stat_open(st } #endif /* CONFIG_NUMA */ +static int mem_cgroup_isolated_write(struct cgroup *cgrp, struct cftype *cft, + const char *buffer) +{ + int ret = -EINVAL; + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + + if (mem_cgroup_is_root(mem)) + goto out; + + if (!strcasecmp(buffer, "true")) + mem->isolated = true; + else if (!strcasecmp(buffer, "false")) + mem->isolated = false; + else + goto out; + + ret = 0; +out: + return ret; +} + +static int mem_cgroup_isolated_read(struct cgroup *cgrp, struct cftype *cft, + struct seq_file *seq) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + + seq_puts(seq, (mem->isolated)?"true":"false"); + + return 0; +} + static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -4624,6 +4663,11 @@ static struct cftype mem_cgroup_files[] .unregister_event = mem_cgroup_oom_unregister_event, .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), }, + { + .name = "isolated", + .write_string = mem_cgroup_isolated_write, + .read_seq_string = mem_cgroup_isolated_read, + }, #ifdef CONFIG_NUMA { .name = "numa_stat", Index: linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/include/linux/memcontrol.h =================================================================== --- linux-3.1-rc4-next-20110831-mmotm-isolated-memcg.orig/include/linux/memcontrol.h +++ linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/include/linux/memcontrol.h @@ -165,6 +165,9 @@ void mem_cgroup_split_huge_fixup(struct bool mem_cgroup_bad_page_check(struct page *page); void mem_cgroup_print_bad_page(struct page *page); #endif + +bool mem_cgroup_isolated(struct mem_cgroup *mem); + #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct mem_cgroup; @@ -382,6 +385,10 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx) { } +bool mem_cgroup_isolated(struct mem_cgroup *mem) +{ + return false; +} #endif /* CONFIG_CGROUP_MEM_CONT */ #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM) Index: linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/mm/vmscan.c =================================================================== --- linux-3.1-rc4-next-20110831-mmotm-isolated-memcg.orig/mm/vmscan.c +++ linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/mm/vmscan.c @@ -2109,7 +2109,13 @@ static void shrink_zone(int priority, st .zone = zone, }; - shrink_mem_cgroup_zone(priority, &mz, sc); + /* + * Do not reclaim from an isolated group if we are in + * the global reclaim. + */ + if (!(mem_cgroup_isolated(mem) && global_reclaim(sc))) + shrink_mem_cgroup_zone(priority, &mz, sc); + /* * Limit reclaim has historically picked one memcg and * scanned it with decreasing priority levels until -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>