On Thu, Feb 2, 2012 at 5:26 PM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote: > On Thu, Feb 02, 2012 at 04:42:09PM +0100, Jan Kara wrote: >> On Thu 02-02-12 19:04:34, Wu Fengguang wrote: >> > If memcg A's dirty rate is throttled, its dirty pages will naturally >> > shrink. The flusher will automatically work less on A's dirty pages. >> I'm not sure about details of requirements Google guys have. So this may >> or may not be good enough for them. I'd suspect they still wouldn't want >> one cgroup to fill up available page cache with dirty pages so just >> limitting bandwidth won't be enough for them. Also limitting dirty >> bandwidth has a problem that it's not coupled with how much reading the >> particular cgroup does. Anyway, until we are sure about their exact >> requirements, this is mostly philosophical talking ;). > > Yeah, I'm not sure what exactly Google needs and how big problem the > partition will be for them. Basically, > > - when there are N memcg each dirtying 1 file, each file will be > flushed on every (N * 0.5) seconds, where 0.5s is the typical time > > - if (memcg_dirty_limit > 10 * bdi_bandwidth), the dd tasks should be > able to progress reasonably smoothly > > Thanks, > Fengguang I am looking for a solution that partitions memory and ideally disk bandwidth. This is a large undertaking and I am willing to start small and grow into a more sophisticated solution (if needed). One important goal is to enforce per-container memory limits - this includes dirty and clean page cache. Moving memcg dirty pages to root is probably not going to work because it would not allow for control of job memory usage. My hunch is that we will thus need per-memcg dirty counters, limits, and some writeback changes. Perhaps the initial writeback changes would be small: enough to ensure that writeback continues writing until it services any over-limit cgroups. This is complicated by the fact that a memcg can have dirty memory spread on different bdi. If blk bandwidth throttling is sufficient here, then let me know because it sounds easier ;) Here is an example of a memcg OOM seen on a 3.3 kernel: # mkdir /dev/cgroup/memory/x # echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes # echo $$ > /dev/cgroup/memory/x/tasks # dd if=/dev/zero of=/data/f1 bs=1k count=1M & # dd if=/dev/zero of=/data/f2 bs=1k count=1M & # wait [1]- Killed dd if=/dev/zero of=/data/f1 bs=1M count=1k [2]+ Killed dd if=/dev/zero of=/data/f1 bs=1M count=1k This is caused from direct reclaim not being able to reliably reclaim (write) dirty page cache pages. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href