Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

Leonardo Brás <leobras@xxxxxxxxxx> · Fri, 27 Jan 2023 03:55:39 -0300

On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote:
> On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote:
> > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> > > [...]
> > > > Remote draining reduces interruptions whether CPU 
> > > > is marked as isolated or not:
> > > > 
> > > > - Allows isolated CPUs from benefiting of pcp caching.
> > > > - Removes the interruption to non isolated CPUs. See for example 
> > > > 
> > > > https://lkml.org/lkml/2022/6/13/2769
> > > 
> > > This is talking about page allocato per cpu caches, right? In this patch
> > > we are talking about memcg pcp caches. Are you sure the same applies
> > > here?
> > 
> > Both can stall the users of the drain operation.
> 
> Yes. But it is important to consider who those users are. We are
> draining when
> 	- we are charging and the limit is hit so that memory reclaim
> 	  has to be triggered.
> 	- hard, high limits are set and require memory reclaim.
> 	- force_empty - full memory reclaim for a memcg
> 	- memcg offlining - cgroup removel - quite a heavy operation as
> 	  well.
> all those could be really costly kernel operations and they affect
> isolated cpu only if the same memcg is used by both isolated and non-isolated
> cpus. In other words those costly operations would have to be triggered
> from non-isolated cpus and those are to be expected to be stalled. It is
> the side effect of the local cpu draining that is scheduled that affects
> the isolated cpu as well.
> 
> Is that more clear?

I think so, please help me check:

IIUC, we can approach this by dividing the problem in two working modes:
1 - Normal, meaning no drain_all_stock() running.
2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
destroying, reconfiguring a memcg.

For (1), we will have (ideally) only local cpu working on the percpu struct.
This mode will not have any kind of contention, because each CPU will hold it's
own spinlock only. 

For (2), we will have a lot of drain_all_stock() running. This will mean a lot
of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
local cpus having to wait for a lock to get their cache, on the patch proposal.

Ok, given the above is correct:

# Some arguments point that (1) becomes slower with this patch.

This is partially true: while test 2.2 pointed that local cpu functions running
time had became slower by a few cycles, test 2.4 points that the userspace
perception of it was that the syscalls and pagefaulting actually became faster:

During some debugging tests before getting the performance on test 2.4, I
noticed that the 'syscall + write' test would call all those functions that
became slower on test 2.2. Those functions were called multiple millions of
times during a single test, and still the patched version performance test
returned faster for test 2.4 than upstream version. Maybe the functions became
slower, but overall the usage of them in the usual context became faster.

Is not that a small improvement?

# Regarding (2), I notice that we fear contention 

While this seems to be the harder part of the discussion, I think we have enough
data to deal with it. 

In which case contention would be a big problem here? 
IIUC it would be when a lot of drain_all_stock() get running because the memory
limit is getting near. I mean, having the user to create / modify a memcg
multiple times a second for a while is not something that is expected, IMHO.

Now, if I assumed correctly and the case where contention could be a problem is
on a memcg with high memory pressure, then we have the argument that Marcelo
Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page
allocation brought better results than local_locks + schedule_work_on().

I mean, while contention would cause the cpu to wait for a while before getting
the lock for allocating a page from cache, something similar would happen with
schedule_work_on(), which would force the current task to wait while the
draining happens locally. 

What I am able to see is that, for each drain_all_stock(), for each cpu getting
drained we have the option to (a) (sometimes) wait for a lock to be freed, or
(b) wait for a whole context switch to happen.
And IIUC, (b) is much slower than (a) on average, and this is what causes the
improved performance seen in [P1].

(I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for
it to run on the remote CPU may not be that different, since the cacheline is
already writen to by the remote cpu on Upstream)

Also according to test 2.2, for the patched version, drain_local_stock() have
gotten faster (much faster for 128 cpus), even though it does all the draining
instead of just scheduling it on the other cpus. 
I mean, summing that to the brief nature of local cpu functions, we may not hit
contention as much as we are expected.

##

Sorry for the long text.
I may be missing some point, please let me know if that's the case.

Thanks a lot for reviewing!
Leo

[P1]: https://lkml.org/lkml/2022/6/13/2769