On Fri, 4 Jan 2013 00:29:11 -0800 Anton Vorontsov <anton.vorontsov@xxxxxxxxxx> wrote: > This commit implements David Rientjes' idea of mempressure cgroup. > > The main characteristics are the same to what I've tried to add to vmevent > API; internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for > pressure index calculation. But we don't expose the index to the userland. > Instead, there are three levels of the pressure: > > o low (just reclaiming, e.g. caches are draining); > o medium (allocation cost becomes high, e.g. swapping); > o oom (about to oom very soon). > > The rationale behind exposing levels and not the raw pressure index > described here: http://lkml.org/lkml/2012/11/16/675 > > For a task it is possible to be in both cpusets, memcg and mempressure > cgroups, so by rearranging the tasks it is possible to watch a specific > pressure (i.e. caused by cpuset and/or memcg). > > Note that while this adds the cgroups support, the code is well separated > and eventually we might add a lightweight, non-cgroups API, i.e. vmevent. > But this is another story. > I'd have thought that it's pretty important offer this feature to non-cgroups setups. Restricting it to cgroups-only seems a large limitation. > diff --git a/mm/mempressure.c b/mm/mempressure.c > new file mode 100644 > index 0000000..ea312bb > --- /dev/null > +++ b/mm/mempressure.c > @@ -0,0 +1,330 @@ > +/* > + * Linux VM pressure > + * > + * Copyright 2012 Linaro Ltd. > + * Anton Vorontsov <anton.vorontsov@xxxxxxxxxx> > + * > + * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro, > + * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2 as published > + * by the Free Software Foundation. > + */ > + > +#include <linux/cgroup.h> > +#include <linux/fs.h> > +#include <linux/sched.h> > +#include <linux/mm.h> > +#include <linux/vmstat.h> > +#include <linux/eventfd.h> > +#include <linux/swap.h> > +#include <linux/printk.h> > + > +static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r); mm/ doesn't use uint or ulong. In fact I can find zero uses of either in all of mm/. I don't have a problem with them personally - they're short and clear. But we just ... don't do that. Perhaps we shold start using them. > + > +/* > + * Generic VM Pressure routines (no cgroups or any other API details) > + */ > + > +/* > + * The window size is the number of scanned pages before we try to analyze > + * the scanned/reclaimed ratio (or difference). > + * > + * It is used as a rate-limit tunable for the "low" level notification, > + * and for averaging medium/oom levels. Using small window sizes can cause > + * lot of false positives, but too big window size will delay the > + * notifications. > + */ > +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16; > +static const uint vmpressure_level_med = 60; > +static const uint vmpressure_level_oom = 99; > +static const uint vmpressure_level_oom_prio = 4; > + > +enum vmpressure_levels { > + VMPRESSURE_LOW = 0, > + VMPRESSURE_MEDIUM, > + VMPRESSURE_OOM, VMPRESSURE_OOM seems an odd-man-out. VMPRESSURE_HIGH would be pleasing. > + VMPRESSURE_NUM_LEVELS, > +}; > + > > ... > > +static void mpc_vmpressure(struct mem_cgroup *memcg, ulong s, ulong r) > +{ > + /* > + * There are two options for implementing cgroup pressure > + * notifications: > + * > + * - Store pressure counter atomically in the task struct. Upon > + * hitting 'window' wake up a workqueue that will walk every > + * task and sum per-thread pressure into cgroup pressure (to > + * which the task belongs). The cons are obvious: bloats task > + * struct, have to walk all processes and makes pressue less > + * accurate (the window becomes per-thread); > + * > + * - Store pressure counters in per-cgroup state. This is easy and > + * straightforward, and that's how we do things here. But this > + * requires us to not put the vmpressure hooks into hotpath, > + * since we have to grab some locks. > + */ > + > +#ifdef CONFIG_MEMCG > + if (memcg) { > + struct cgroup_subsys_state *css = mem_cgroup_css(memcg); > + struct cgroup *cg = css->cgroup; > + struct mpc_state *mpc = cg2mpc(cg); > + > + if (mpc) > + __mpc_vmpressure(mpc, s, r); > + return; > + } > +#endif > + task_lock(current); > + __mpc_vmpressure(tsk2mpc(current), s, r); > + task_unlock(current); > +} The task_lock() is mysterious. What's it protecting? That's unobvious and afacit undocumented. Also it is buggy: __mpc_vmpressure() does mutex_lock(). Documentation/SubmitChecklist section 12 has handy hints! > > ... > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>