Re: [PATCH V3 0/2] memcg softlimit reclaim rework

Johannes Weiner <hannes@xxxxxxxxxxx> · Fri, 20 Apr 2012 00:33:18 +0200

On Thu, Apr 19, 2012 at 10:47:27AM -0700, Ying Han wrote:
> On Thu, Apr 19, 2012 at 10:04 AM, Michal Hocko <mhocko@xxxxxxx> wrote:
> > On Wed 18-04-12 11:00:40, Ying Han wrote:
> >> On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >> > On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
> >> >> The "soft_limit" was introduced in memcg to support over-committing the
> >> >> memory resource on the host. Each cgroup configures its "hard_limit" where
> >> >> it will be throttled or OOM killed by going over the limit. However, the
> >> >> cgroup can go above the "soft_limit" as long as there is no system-wide
> >> >> memory contention. So, the "soft_limit" is the kernel mechanism for
> >> >> re-distributing system spare memory among cgroups.
> >> >>
> >> >> This patch reworks the softlimit reclaim by hooking it into the new global
> >> >> reclaim scheme. So the global reclaim path including direct reclaim and
> >> >> background reclaim will respect the memcg softlimit.
> >> >>
> >> >> v3..v2:
> >> >> 1. rebase the patch on 3.4-rc3
> >> >> 2. squash the commits of replacing the old implementation with new
> >> >> implementation into one commit. This is to make sure to leave the tree
> >> >> in stable state between each commit.
> >> >> 3. removed the commit which changes the nr_to_reclaim for global reclaim
> >> >> case. The need of that patch is not obvious now.
> >> >>
> >> >> Note:
> >> >> 1. the new implementation of softlimit reclaim is rather simple and first
> >> >> step for further optimizations. there is no memory pressure balancing between
> >> >> memcgs for each zone, and that is something we would like to add as follow-ups.
> >> >>
> >> >> 2. this patch is slightly different from the last one posted from Johannes
> >> >> http://comments.gmane.org/gmane.linux.kernel.mm/72382
> >> >> where his patch is closer to the reverted implementation by doing hierarchical
> >> >> reclaim for each selected memcg. However, that is not expected behavior from
> >> >> user perspective. Considering the following example:
> >> >>
> >> >> root (32G capacity)
> >> >> --> A (hard limit 20G, soft limit 15G, usage 16G)
> >> >>    --> A1 (soft limit 5G, usage 4G)
> >> >>    --> A2 (soft limit 10G, usage 12G)
> >> >> --> B (hard limit 20G, soft limit 10G, usage 16G)
> >> >>
> >> >> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
> >> >> exceeds softlimit. This is what admin expects by setting softlimit to the
> >> >> actual working set size and only reclaim pages under softlimit if system has
> >> >> trouble to reclaim.
> >> >
> >> > Actually, this is exactly what the admin expects when creating a
> >> > hierarchy, because she defines that A1 is a child of A and is
> >> > responsible for the memory situation in its parent.
> >
> > Hmm, I guess that both approaches have cons and pros.
> > * Hierarchical soft limit reclaim - reclaim the whole subtree of the over
> >  soft limit memcg
> >  + it is consistent with the hard limit reclaim
> Not sure why we want them to be consistent. Soft_limit is serving
> different purpose and the one of the main purpose is to preserve the
> working set of the cgroup.

I'd argue, given the history of cgroups, one of the main purposes is
having a machine of containers where you overcommit their hard limit
and set the soft limit accordingly to provide fairness.

Yes, we don't want to reclaim hierarchies that are below their soft
limit as long as there are some in excess, of course.  This is a flaw
and needs fixing.  But it's something completely different than
changing how the soft limit is defined and suddenly allow child
groups, which you may not trust, to override rules defined by parental
groups.

It bothers me that we should add something that will almost certainly
bite us in the future while we are discussing on the cgroups list what
would stand in the way of getting sane hierarchy semantics across
controllers to provide consistency, nesting, etc.

To support a single use case, which I feel we still have not discussed
nearly enough to justify this change.

For example, I get that you want 'meta-groups' that group together
subgroups for common accounting and hard limiting.  But I don't see
why such meta-groups have their own processes.  Conceptually, I mean,
how does a process fit into A?  Is it superior to the tasks in A1 and
A2?  Why can't it live in A3?

So here is a proposal:

Would it make sense to try to keep those meta groups always free of
their own memory so that they don't /need/ soft limits with weird
semantics?  E.g. immediately free the unused memory on rmdir, OR add
mechanisms to migrate the memory to a dedicated group:

     A
       A1 (soft-limited)
       A2 (soft-limited)
     B
     unused (soft-limited)

Move all leftover memory from finished jobs to this 'unused' group.
You could set its soft limit to 0 so that it sticks around only until
you actually need the memory for something else.

Then you would get the benefits of accounting and limiting A1 and A2
under a single umbrella without the need for a soft limit in A.  We
could keep the consistent semantics for soft limits, because you would
only have to set it on leaf nodes.

Wouldn't this work for you?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>