Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

Roman Gushchin <roman.gushchin@xxxxxxxxx> · Wed, 25 Jan 2023 15:14:48 -0800

On Wed, Jan 25, 2023 at 03:22:00PM -0300, Marcelo Tosatti wrote:
> On Wed, Jan 25, 2023 at 08:06:46AM -0300, Leonardo Brás wrote:
> > On Wed, 2023-01-25 at 09:33 +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 04:34:57, Leonardo Bras wrote:
> > > > Disclaimer:
> > > > a - The cover letter got bigger than expected, so I had to split it in
> > > >     sections to better organize myself. I am not very confortable with it.
> > > > b - Performance numbers below did not include patch 5/5 (Remove flags
> > > >     from memcg_stock_pcp), which could further improve performance for
> > > >     drain_all_stock(), but I could only notice the optimization at the
> > > >     last minute.
> > > > 
> > > > 
> > > > 0 - Motivation:
> > > > On current codebase, when drain_all_stock() is ran, it will schedule a
> > > > drain_local_stock() for each cpu that has a percpu stock associated with a
> > > > descendant of a given root_memcg.

Do you know what caused those drain_all_stock() calls? I wonder if we should look
into why we have many of them and whether we really need them?

It's either some user's actions (e.g. reducing memory.max), either some memcg
is entering pre-oom conditions. In the latter case a lot of drain calls can be
scheduled without a good reason (assuming the cgroup contain multiple tasks running
on multiple cpus). Essentially each cpu will try to grab the remains of the memory quota
and move it locally. I wonder in such circumstances if we need to disable the pcp-caching
on per-cgroup basis.

Generally speaking, draining of pcpu stocks is useful only if an idle cpu is holding some
charges/memcg references (it might be not completely idle, but running some very special
workload which is not doing any kernel allocations or a process belonging to the root memcg).
In all other cases pcpu stock will be either drained naturally by an allocation from another
memcg or an allocation from the same memcg will "restore" it, making draining useless.

We also can into drain_all_pages() opportunistically, without waiting for the result.
On a busy system it's most likely useless, we might oom before scheduled works will be executed.

I admit I planned to do some work around and even started, but then never had enough time to
finish it.

Overall I'm somewhat resistant to an idea of making generic allocation & free paths slower
for an improvement of stock draining. It's not a strong objection, but IMO we should avoid
doing this without a really strong reason.

Thanks!