Greg, On Thu, Feb 02, 2012 at 10:21:53PM -0800, Greg Thelen wrote: > I am looking for a solution that partitions memory and ideally disk > bandwidth. This is a large undertaking and I am willing to start > small and grow into a more sophisticated solution (if needed). One > important goal is to enforce per-container memory limits - this > includes dirty and clean page cache. Moving memcg dirty pages to root > is probably not going to work because it would not allow for control > of job memory usage. If reserving 20% global memory for dirty/writeback pages from the memcg allocations, it will do the trick: each job will use at most its memcg limit, plus some share of the 20% dirty limit. Since the moved pages are marked PG_reclaim and hence will be freed quickly after become clean, it's guaranteed that the dirty pages moved out of the memcgs won't outnumber the 20% global dirty limit at any time. So it would be some kind of per-job memcg container plus a globally shared 20% dirty pages container. The job pages won't further leak and become uncontrollable. But if this does not fit nicely into Google's usage model, I'm fine with adding per-memcg dirty limits, bearing in mind that the per-memcg dirty limits won't be able to work fluently if not large enough. We can do some experiments on that once get the minimal patch ready. > My hunch is that we will thus need per-memcg > dirty counters, limits, and some writeback changes. Perhaps the > initial writeback changes would be small: enough to ensure that > writeback continues writing until it services any over-limit cgroups. Yeah, that's a good plan. > This is complicated by the fact that a memcg can have dirty memory > spread on different bdi. That sure sounds complicated. The other problem is the pos_ratio will no longer be roughly equal to each other for all the tasks writing to the same bdi, making the bdi dirty_ratelimit less stable. Again, we can experiment how well the control system behaves. > If blk bandwidth throttling is sufficient > here, then let me know because it sounds easier ;) I'd love to say so, however bandwidth throttling is obviously not the right solution to the below example ;) > Here is an example of a memcg OOM seen on a 3.3 kernel: > # mkdir /dev/cgroup/memory/x > # echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes > # echo $$ > /dev/cgroup/memory/x/tasks > # dd if=/dev/zero of=/data/f1 bs=1k count=1M & > # dd if=/dev/zero of=/data/f2 bs=1k count=1M & > # wait > [1]- Killed dd if=/dev/zero of=/data/f1 bs=1M count=1k > [2]+ Killed dd if=/dev/zero of=/data/f1 bs=1M count=1k > > This is caused from direct reclaim not being able to reliably reclaim > (write) dirty page cache pages. If moving dirty pages out of the memcg to the 20% global dirty pages pool on page reclaim, the above OOM can be avoided. It does change the meaning of memory.limit_in_bytes in that the memcg tasks can now actually consume more pages (up to the shared global 20% dirty limit). Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>