Re: [PATCH V6 00/10] memcg: per cgroup background reclaim

Ying Han <yinghan@xxxxxxxxxx> · Wed, 20 Apr 2011 22:28:17 -0700

On Wed, Apr 20, 2011 at 10:08 PM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote:

On Thu, Apr 21, 2011 at 01:00:16PM +0900, KAMEZAWA Hiroyuki wrote:

> On Thu, 21 Apr 2011 04:51:07 +0200

> Johannes Weiner <hannes@xxxxxxxxxxx> wrote:

>

> > > If the cgroup is configured to use per cgroup background reclaim, a kswapd

> > > thread is created which only scans the per-memcg LRU list.

> >

> > We already have direct reclaim, direct reclaim on behalf of a memcg,

> > and global kswapd-reclaim.  Please don't add yet another reclaim path

> > that does its own thing and interacts unpredictably with the rest of

> > them.

> >

> > As discussed on LSF, we want to get rid of the global LRU.  So the

> > goal is to have each reclaim entry end up at the same core part of

> > reclaim that round-robin scans a subset of zones from a subset of

> > memory control groups.

>

> It's not related to this set. And I think even if we remove global LRU,

> global-kswapd and memcg-kswapd need to do independent work.

>

> global-kswapd : works for zone/node balancing and making free pages,

>                 and compaction. select a memcg vicitm and ask it

>                 to reduce memory with regard to gfp_mask. Starts its work

>                 when zone/node is unbalanced.

For soft limit reclaim (which is triggered by global memory pressure),

we want to scan a group of memory cgroups equally in round robin

fashion.  I think at LSF we established that it is not fair to find

the one that exceeds its limit the most and hammer it until memory

pressure is resolved or there is another group with more excess.

So even for global kswapd, sooner or later we need a mechanism to

apply equal pressure to a set of memcgs.

With the removal of the global LRU, we ALWAYS operate on a set of

memcgs in a round-robin fashion, not just for soft limit reclaim.

So yes, these are two different things, but they have the same

requirements.

Hmm. I don't see we have disagreement on the global-kswapd. The plan now is to do the round-robin based
on their soft_limit. (note, this is not how it is implemented now, and I am working on the patch now)

> memcg-kswapd  : works for reducing usage of memory, no interests on

>                 zone/nodes. Starts when high/low watermaks hits.

When the watermark is hit in the charge path, we want to wake up the

daemon to reclaim from a specific memcg.

When multiple memcgs exceed their watermarks in parallel (after all,

we DO allow concurrency), we again have a group of memcgs we want to

reclaim from in a fair fashion until their watermarks are met again.

And memcg reclaim is not oblivious to nodes and zones, right now, we

also do mind the current node and respect the zone balancing when we

do direct reclaim on behalf of a memcg.

So, to be honest, I really don't see how both cases should be

independent from each other.  On the contrary, I see very little

difference between them.  The entry path differs slightly as well as

the predicate for the set of memcgs to scan.  But most of the worker

code is exactly the same, no?

They are triggered at different point and the target are different. One is triggered under global pressure,
and the calculation of which memcg and how much to reclaim are based on soft_limit. Also, the target is to bring the zone under the wmark, as well as the zone balancing. The other one is triggered per-memcg on wmarks, and the target is to bring the memcg usage below the wmark. 

> > > Two watermarks ("high_wmark", "low_wmark") are added to trigger the

> > > background reclaim and stop it. The watermarks are calculated based

> > > on the cgroup's limit_in_bytes.

> >

> > Which brings me to the next issue: making the watermarks configurable.

> >

> > You argued that having them adjustable from userspace is required for

> > overcommitting the hardlimits and per-memcg kswapd reclaim not kicking

> > in in case of global memory pressure.  But that is only a problem

> > because global kswapd reclaim is (apart from soft limit reclaim)

> > unaware of memory control groups.

> >

> > I think the much better solution is to make global kswapd memcg aware

> > (with the above mentioned round-robin reclaim scheduler), compared to

> > adding new (and final!) kernel ABI to avoid an internal shortcoming.

>

> I don't think its a good idea to kick kswapd even when free memory is enough.

This depends on what kswapd is supposed to be doing.  I don't say we

should reclaim from all memcgs (i.e. globally) just because one memcg

hits its watermark, of course.

But the argument was that we need the watermarks configurable to force

per-memcg reclaim even when the hard limits are overcommitted, because

global reclaim does not do a fair job to balance memcgs.

There seems to be some confusion here. The watermark we defined is per-memcg, and that is calculated
based on the hard_limit. We need the per-memcg wmark the same reason of per-zone wmart which triggers
the background reclaim before direct reclaim. 

There is a patch in my patchset which adds the tunable for both high/low_mark, which gives more flexibility to admin to config the host. In over-commit environment, we might never hit the wmark if all the wmarks are set internally. 

My counter proposal is to fix global reclaim instead and apply equal pressure on memcgs, such that we never have to tweak per-memcg watermarks to achieve the same thing.

We still need this and that is the soft_limit reclaim under global background reclaim.

--Ying