Re: [PATCH V3 0/2] memcg softlimit reclaim rework

Ying Han <yinghan@xxxxxxxxxx> · Fri, 20 Apr 2012 10:22:11 -0700

On Fri, Apr 20, 2012 at 1:11 AM, Michal Hocko <mhocko@xxxxxxx> wrote:
> On Thu 19-04-12 10:47:27, Ying Han wrote:
>> On Thu, Apr 19, 2012 at 10:04 AM, Michal Hocko <mhocko@xxxxxxx> wrote:
>> > On Wed 18-04-12 11:00:40, Ying Han wrote:
>> >> On Wed, Apr 18, 2012 at 5:24 AM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>> >> > On Tue, Apr 17, 2012 at 09:37:46AM -0700, Ying Han wrote:
>> >> >> The "soft_limit" was introduced in memcg to support over-committing the
>> >> >> memory resource on the host. Each cgroup configures its "hard_limit" where
>> >> >> it will be throttled or OOM killed by going over the limit. However, the
>> >> >> cgroup can go above the "soft_limit" as long as there is no system-wide
>> >> >> memory contention. So, the "soft_limit" is the kernel mechanism for
>> >> >> re-distributing system spare memory among cgroups.
>> >> >>
>> >> >> This patch reworks the softlimit reclaim by hooking it into the new global
>> >> >> reclaim scheme. So the global reclaim path including direct reclaim and
>> >> >> background reclaim will respect the memcg softlimit.
>> >> >>
>> >> >> v3..v2:
>> >> >> 1. rebase the patch on 3.4-rc3
>> >> >> 2. squash the commits of replacing the old implementation with new
>> >> >> implementation into one commit. This is to make sure to leave the tree
>> >> >> in stable state between each commit.
>> >> >> 3. removed the commit which changes the nr_to_reclaim for global reclaim
>> >> >> case. The need of that patch is not obvious now.
>> >> >>
>> >> >> Note:
>> >> >> 1. the new implementation of softlimit reclaim is rather simple and first
>> >> >> step for further optimizations. there is no memory pressure balancing between
>> >> >> memcgs for each zone, and that is something we would like to add as follow-ups.
>> >> >>
>> >> >> 2. this patch is slightly different from the last one posted from Johannes
>> >> >> http://comments.gmane.org/gmane.linux.kernel.mm/72382
>> >> >> where his patch is closer to the reverted implementation by doing hierarchical
>> >> >> reclaim for each selected memcg. However, that is not expected behavior from
>> >> >> user perspective. Considering the following example:
>> >> >>
>> >> >> root (32G capacity)
>> >> >> --> A (hard limit 20G, soft limit 15G, usage 16G)
>> >> >>    --> A1 (soft limit 5G, usage 4G)
>> >> >>    --> A2 (soft limit 10G, usage 12G)
>> >> >> --> B (hard limit 20G, soft limit 10G, usage 16G)
>> >> >>
>> >> >> Under global reclaim, we shouldn't add pressure on A1 although its parent(A)
>> >> >> exceeds softlimit. This is what admin expects by setting softlimit to the
>> >> >> actual working set size and only reclaim pages under softlimit if system has
>> >> >> trouble to reclaim.
>> >> >
>> >> > Actually, this is exactly what the admin expects when creating a
>> >> > hierarchy, because she defines that A1 is a child of A and is
>> >> > responsible for the memory situation in its parent.
>> >
>> > Hmm, I guess that both approaches have cons and pros.
>> > * Hierarchical soft limit reclaim - reclaim the whole subtree of the over
>> >  soft limit memcg
>> >  + it is consistent with the hard limit reclaim
>> Not sure why we want them to be consistent. Soft_limit is serving
>> different purpose and the one of the main purpose is to preserve the
>> working set of the cgroup.
>
> Well, cgroups subsystem is moving towards unification so all the
> controllers should live in one hierarchy and it would be nice if we had
> a common view on what hard and soft limits mean wrt. hierarchies. It is
> true that memcg is the only user of the soft limit in the moment but it
> would be better if we were prepared for future users are well and
> wouldn't come up with one shot solutions.
>
>> >  + easier for top to bottom configuration - especially when you allow
>> >    subgroups to create deeper hierarchies. Does anybody do that?
>>
>> As far as I heard, most (if not all) are using flat configuration
>> where everything is running under root.
>
> Might be true for memcg but what about other controllers?

>
> [...]
>> > Both approaches don't play very well with the default 0 limit because we
>> > either reclaim unless we set up the whole hierarchy properly or we just
>> > burn cycles by trying to reclaim groups wit no or only few pages.
>>
>> Setting the default to 0 is a good optimization which makes everybody
>> to be eligible for reclaim if admin doesn't do anything.
>>
>> In reality, if admin want to preserve working set of cgroups and
>> he/she has to set the softlimit. By doing that, it is easier to only
>> focus on the cgroup itself without looking up its ancestors.
>
> I guess it is not that clear who should be responsible for setting the
> limit. Should it be admin or rather a workload owner? Because this
> changes a lot.

Today the model we have is letting admin setting it by monitoring each
cgroup's working set size. But I think it would be also use case to
let the workload itself to set it. Something like self-ballooning.

>
>>
>> > The second approach leads to more expected results though because we do
>> > not touch "leaf" groups unless they are over limit.
>> > I have to think about that some more but it seems that the second approach
>> > is much easier to implement and matches the "guarantee" expectations
>> > more.
>>
>> Agree.
>>
>> > I guess we could converge both approaches if we could reclaim from the
>> > leaf groups upwards to the root but I didn't think about this very much.
>>
>> That is what the current patch does, which only consider softlimit
>> under global pressure :)
>
> Not really, because your patch iterates sequentially from top to bottom.
> I was thinking about iteration from the leaves and do the hierarchical
> reclaim from the first one which is over the limit. This would uncharge
> from the parent as well so it could get down under its limit and if not
> then we can hammer on siblings. But, as I said, I did give this more
> thoughts, it sure comes with its own set of issues (including
> inconsistency with the hard limit reclaim ;))

I feel like we mixed two things together here:

1. per-memcg reclaim: This is triggered when A reaches its hard_limit,
and then we do hierarchical reclaim including A1 and A2

2. global reclaim: This is triggered when root reach its limit ( root
doesn't has limit, but we can say something like that), and then we do
hierarchical reclaim including all the cgroups on the system.

The soft_reclaim I have here (so far) only triggers under global
reclaim, so we follow the same rule of existing global reclaim except
filtering out memcgs under their soft_limit under certain degree.

If we are talking about to add soft_limit reclaim in per-memcg
reclaim, that is when we cares about the limit of A when reclaiming
from A1. If we decide to do that, it should be added in per-memcg
reclaim logic (small change by removing the global reclaim check).

--Ying

>
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href