Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.

Greg Thelen <gthelen@xxxxxxxxxx> · Thu, 2 Feb 2012 22:21:53 -0800

On Thu, Feb 2, 2012 at 5:26 PM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
> On Thu, Feb 02, 2012 at 04:42:09PM +0100, Jan Kara wrote:
>> On Thu 02-02-12 19:04:34, Wu Fengguang wrote:
>> > If memcg A's dirty rate is throttled, its dirty pages will naturally
>> > shrink. The flusher will automatically work less on A's dirty pages.
>> I'm not sure about details of requirements Google guys have. So this may
>> or may not be good enough for them. I'd suspect they still wouldn't want
>> one cgroup to fill up available page cache with dirty pages so just
>> limitting bandwidth won't be enough for them. Also limitting dirty
>> bandwidth has a problem that it's not coupled with how much reading the
>> particular cgroup does. Anyway, until we are sure about their exact
>> requirements, this is mostly philosophical talking ;).
>
> Yeah, I'm not sure what exactly Google needs and how big problem the
> partition will be for them. Basically,
>
> - when there are N memcg each dirtying 1 file, each file will be
>  flushed on every (N * 0.5) seconds, where 0.5s is the typical time
>
> - if (memcg_dirty_limit > 10 * bdi_bandwidth), the dd tasks should be
>  able to progress reasonably smoothly
>
> Thanks,
> Fengguang

I am looking for a solution that partitions memory and ideally disk
bandwidth.  This is a large undertaking and I am willing to start
small and grow into a more sophisticated solution (if needed).  One
important goal is to enforce per-container memory limits - this
includes dirty and clean page cache.  Moving memcg dirty pages to root
is probably not going to work because it would not allow for control
of job memory usage.  My hunch is that we will thus need per-memcg
dirty counters, limits, and some writeback changes.  Perhaps the
initial writeback changes would be small: enough to ensure that
writeback continues writing until it services any over-limit cgroups.
This is complicated by the fact that a memcg can have dirty memory
spread on different bdi.  If blk bandwidth throttling is sufficient
here, then let me know because it sounds easier ;)

Here is an example of a memcg OOM seen on a 3.3 kernel:
        # mkdir /dev/cgroup/memory/x
        # echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
        # echo $$ > /dev/cgroup/memory/x/tasks
        # dd if=/dev/zero of=/data/f1 bs=1k count=1M &
        # dd if=/dev/zero of=/data/f2 bs=1k count=1M &
        # wait
        [1]-  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k
        [2]+  Killed                  dd if=/dev/zero of=/data/f1 bs=1M count=1k

This is caused from direct reclaim not being able to reliably reclaim
(write) dirty page cache pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href