(2013/01/04 17:29), Anton Vorontsov wrote: > This commit implements David Rientjes' idea of mempressure cgroup. > > The main characteristics are the same to what I've tried to add to vmevent > API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for > pressure index calculation. But we don't expose the index to the userland. > Instead, there are three levels of the pressure: > > o low (just reclaiming, e.g. caches are draining); > o medium (allocation cost becomes high, e.g. swapping); > o oom (about to oom very soon). > > The rationale behind exposing levels and not the raw pressure index > described here: http://lkml.org/lkml/2012/11/16/675 > > For a task it is possible to be in both cpusets, memcg and mempressure > cgroups, so by rearranging the tasks it is possible to watch a specific > pressure (i.e. caused by cpuset and/or memcg). > > Note that while this adds the cgroups support, the code is well separated > and eventually we might add a lightweight, non-cgroups API, i.e. vmevent. > But this is another story. > > Signed-off-by: Anton Vorontsov <anton.vorontsov@xxxxxxxxxx> I'm just curious.. > --- > Documentation/cgroups/mempressure.txt | 50 ++++++ > include/linux/cgroup_subsys.h | 6 + > include/linux/vmstat.h | 11 ++ > init/Kconfig | 12 ++ > mm/Makefile | 1 + > mm/mempressure.c | 330 ++++++++++++++++++++++++++++++++++ > mm/vmscan.c | 4 + > 7 files changed, 414 insertions(+) > create mode 100644 Documentation/cgroups/mempressure.txt > create mode 100644 mm/mempressure.c > > diff --git a/Documentation/cgroups/mempressure.txt b/Documentation/cgroups/mempressure.txt > new file mode 100644 > index 0000000..dbc0aca > --- /dev/null > +++ b/Documentation/cgroups/mempressure.txt > @@ -0,0 +1,50 @@ > + Memory pressure cgroup > +~~~~~~~~~~~~~~~~~~~~~~~~~~ > + Before using the mempressure cgroup, make sure you have it mounted: > + > + # cd /sys/fs/cgroup/ > + # mkdir mempressure > + # mount -t cgroup cgroup ./mempressure -o mempressure > + > + It is possible to combine cgroups, for example you can mount memory > + (memcg) and mempressure cgroups together: > + > + # mount -t cgroup cgroup ./mempressure -o memory,mempressure > + > + That way the reported pressure will honour memory cgroup limits. The > + same goes for cpusets. > + > + After the hierarchy is mounted, you can use the following API: > + > + /sys/fs/cgroup/.../mempressure.level > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > + To maintain the interactivity/memory allocation cost, one can use the > + pressure level notifications, and the levels are defined like this: > + > + The "low" level means that the system is reclaiming memory for new > + allocations. Monitoring reclaiming activity might be useful for > + maintaining overall system's cache level. Upon notification, the program > + (typically "Activity Manager") might analyze vmstat and act in advance > + (i.e. prematurely shutdown unimportant services). > + > + The "medium" level means that the system is experiencing medium memory > + pressure, there is some mild swapping activity. Upon this event > + applications may decide to free any resources that can be easily > + reconstructed or re-read from a disk. > + > + The "oom" level means that the system is actively thrashing, it is about > + to out of memory (OOM) or even the in-kernel OOM killer is on its way to > + trigger. Applications should do whatever they can to help the system. > + > + Event control: > + Is used to setup an eventfd with a level threshold. The argument to > + the event control specifies the level threshold. > + Read: > + Reads mempory presure levels: low, medium or oom. > + Write: > + Not implemented. > + Test: > + To set up a notification: > + > + # cgroup_event_listener ./mempressure.level low > + ("low", "medium", "oom" are permitted.) > diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h > index f204a7a..b9802e2 100644 > --- a/include/linux/cgroup_subsys.h > +++ b/include/linux/cgroup_subsys.h > @@ -37,6 +37,12 @@ SUBSYS(mem_cgroup) > > /* */ > > +#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE) > +SUBSYS(mpc_cgroup) > +#endif > + > +/* */ > + > #if IS_SUBSYS_ENABLED(CONFIG_CGROUP_DEVICE) > SUBSYS(devices) > #endif > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > index a13291f..c1a66c7 100644 > --- a/include/linux/vmstat.h > +++ b/include/linux/vmstat.h > @@ -10,6 +10,17 @@ > > extern int sysctl_stat_interval; > > +struct mem_cgroup; > +#ifdef CONFIG_CGROUP_MEMPRESSURE > +extern void vmpressure(struct mem_cgroup *memcg, > + ulong scanned, ulong reclaimed); > +extern void vmpressure_prio(struct mem_cgroup *memcg, int prio); > +#else > +static inline void vmpressure(struct mem_cgroup *memcg, > + ulong scanned, ulong reclaimed) {} > +static inline void vmpressure_prio(struct mem_cgroup *memcg, int prio) {} > +#endif > + > #ifdef CONFIG_VM_EVENT_COUNTERS > /* > * Light weight per cpu counter implementation. > diff --git a/init/Kconfig b/init/Kconfig > index 7d30240..d526249 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -891,6 +891,18 @@ config MEMCG_KMEM > the kmem extension can use it to guarantee that no group of processes > will ever exhaust kernel resources alone. > > +config CGROUP_MEMPRESSURE > + bool "Memory pressure monitor for Control Groups" > + help > + The memory pressure monitor cgroup provides a facility for > + userland programs so that they could easily assist the kernel > + with the memory management. So far the API provides simple, > + levels-based memory pressure notifications. > + > + For more information see Documentation/cgroups/mempressure.txt > + > + If unsure, say N. > + > config CGROUP_HUGETLB > bool "HugeTLB Resource Controller for Control Groups" > depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL > diff --git a/mm/Makefile b/mm/Makefile > index 3a46287..e69bbda 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -51,6 +51,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o > obj-$(CONFIG_QUICKLIST) += quicklist.o > obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o > obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o > +obj-$(CONFIG_CGROUP_MEMPRESSURE) += mempressure.o > obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o > obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o > obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o > diff --git a/mm/mempressure.c b/mm/mempressure.c > new file mode 100644 > index 0000000..ea312bb > --- /dev/null > +++ b/mm/mempressure.c > @@ -0,0 +1,330 @@ > +/* > + * Linux VM pressure > + * > + * Copyright 2012 Linaro Ltd. > + * Anton Vorontsov <anton.vorontsov@xxxxxxxxxx> > + * > + * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro, > + * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2 as published > + * by the Free Software Foundation. > + */ > + > +#include <linux/cgroup.h> > +#include <linux/fs.h> > +#include <linux/sched.h> > +#include <linux/mm.h> > +#include <linux/vmstat.h> > +#include <linux/eventfd.h> > +#include <linux/swap.h> > +#include <linux/printk.h> > + > +static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r); > + > +/* > + * Generic VM Pressure routines (no cgroups or any other API details) > + */ > + > +/* > + * The window size is the number of scanned pages before we try to analyze > + * the scanned/reclaimed ratio (or difference). > + * > + * It is used as a rate-limit tunable for the "low" level notification, > + * and for averaging medium/oom levels. Using small window sizes can cause > + * lot of false positives, but too big window size will delay the > + * notifications. > + */ > +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16; > +static const uint vmpressure_level_med = 60; > +static const uint vmpressure_level_oom = 99; > +static const uint vmpressure_level_oom_prio = 4; > + Hmm... isn't this window size too small ? If vmscan cannot find a reclaimable page while scanning 2M of pages in a zone, oom notify will be returned. Right ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>