Re: [LSF/MM TOPIC ATTEND]

Michal Hocko <mhocko@xxxxxxx> · Thu, 8 Jan 2015 10:09:01 +0100

On Thu 08-01-15 11:33:53, Vladimir Davydov wrote:
> On Wed, Jan 07, 2015 at 03:38:58PM +0100, Michal Hocko wrote:
> > On Wed 07-01-15 11:58:28, Vladimir Davydov wrote:
> > > On Tue, Jan 06, 2015 at 05:14:35PM +0100, Michal Hocko wrote:
> > > [...]
> > > > And as a memcg co-maintainer I would like to also discuss the following
> > > > topics.
> > > > - We should finally settle down with a set of core knobs exported with
> > > >   the new unified hierarchy cgroups API. I have proposed this already
> > > >   http://marc.info/?l=linux-mm&m=140552160325228&w=2 but there is no
> > > >   clear consensus and the discussion has died later on. I feel it would
> > > >   be more productive to sit together and come up with a reasonable
> > > >   compromise between - let's start from the begining and keep useful and
> > > >   reasonable features.
> > > >   
> > > > - kmem accounting is seeing a lot of activity mainly thanks to Vladimir.
> > > >   He is basically the only active developer in this area. I would be
> > > >   happy if he can attend as well and discuss his future plans in the
> > > >   area. The work overlaps with slab allocators and slab shrinkers so
> > > >   having people familiar with these areas would be more than welcome
> > > 
> > > One more memcg related topic that is worth discussing IMO:
> > > 
> > >  - On global memory pressure we walk over all memory cgroups and scan
> > >    pages from each of them. Since there can be hundreds or even
> > >    thousands of memory cgroups, such a walk can be quite expensive,
> > >    especially if the cgroups are small so that to reclaim anything from
> > >    them we have to descend to a lower scan priority.
> > 
> >      We do not get to lower priorities just to scan small cgroups. They
> >      will simply get ignored unless we are force scanning them.
> 
> That means that small cgroups (< 16 M) may not be scanned at all if
> there are enough reclaimable pages in bigger cgroups. I'm not sure if
> anyone will mix small and big cgroups on the same host though. However,
> currently this may render offline memory cgroups hanging around forever
> if they have some memory on destruction, because they will become small
> due to global reclaim sooner or later. OTOH, we could always forcefully
> scan lruvecs that belong to dead cgroups, or limit the maximal number of
> dead cgroups, w/o reworking the reclaimer.

Makes sense! Now that we do not reparent on offline this might indeed be
a problem. Care to send a patch? I will cook up something if you do not
have time for that.

Something along these lines should work but I haven't thought about that
very much to be honest:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e29f411b38ac..277585176a9e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1935,7 +1935,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
 	 * latencies, so it's better to scan a minimum amount there as
 	 * well.
 	 */
-	if (current_is_kswapd() && !zone_reclaimable(zone))
+	if (current_is_kswapd() && (!zone_reclaimable(zone) || mem_cgroup_need_force_scan(sc->target_mem_cgroup)))
 		force_scan = true;
 	if (!global_reclaim(sc))
 		force_scan = true;
 
> > >    The problem is
> > >    augmented by offline memory cgroups, which now can be dangling for
> > >    indefinitely long time.
> > 
> > OK, but shrink_lruvec shouldn't do too much work on a memcg which
> > doesn't have any pages to scan for the given priority. Or have you
> > seen this in some profiles?
> 
> In real life, no.
> 
> > 
> > >    That's why I think we should work out a better algorithm for the
> > >    memory reclaimer. May be, we could rank memory cgroups somehow (by
> > >    their age, memory consumption?) and try to scan only the top ranked
> > >    cgroup during a reclaimer run.
> > 
> > We still have to keep some fairness and reclaim all groups
> > proportionally and balancing this would be quite non-trivial. I am not
> > saying we couldn't implement our iterators in a more intelligent way but
> > this code is quite complex already and I haven't seen this as a big
> > problem yet. Some overhead is to be expected when thousands of groups
> > are configured, right?
> 
> Right, sounds convincing. Let's cross out this topic then until we see
> complains from real users. No need to spend time on it right now.
> 
> Sorry for the noise.

No noise at all!

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>