Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining

Leonardo Brás <leobras@xxxxxxxxxx> · Wed, 01 Feb 2023 01:36:07 -0300

On Tue, 2023-01-31 at 08:35 -0300, Marcelo Tosatti wrote:
> On Fri, Jan 27, 2023 at 03:55:39AM -0300, Leonardo Brás wrote:
> > On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote:
> > > On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote:
> > > > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote:
> > > > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote:
> > > > > [...]
> > > > > > Remote draining reduces interruptions whether CPU 
> > > > > > is marked as isolated or not:
> > > > > > 
> > > > > > - Allows isolated CPUs from benefiting of pcp caching.
> > > > > > - Removes the interruption to non isolated CPUs. See for example 
> > > > > > 
> > > > > > https://lkml.org/lkml/2022/6/13/2769
> > > > > 
> > > > > This is talking about page allocato per cpu caches, right? In this patch
> > > > > we are talking about memcg pcp caches. Are you sure the same applies
> > > > > here?
> > > > 
> > > > Both can stall the users of the drain operation.
> > > 
> > > Yes. But it is important to consider who those users are. We are
> > > draining when
> > > 	- we are charging and the limit is hit so that memory reclaim
> > > 	  has to be triggered.
> > > 	- hard, high limits are set and require memory reclaim.
> > > 	- force_empty - full memory reclaim for a memcg
> > > 	- memcg offlining - cgroup removel - quite a heavy operation as
> > > 	  well.
> > > all those could be really costly kernel operations and they affect
> > > isolated cpu only if the same memcg is used by both isolated and non-isolated
> > > cpus. In other words those costly operations would have to be triggered
> > > from non-isolated cpus and those are to be expected to be stalled. It is
> > > the side effect of the local cpu draining that is scheduled that affects
> > > the isolated cpu as well.
> > > 
> > > Is that more clear?
> > 
> > I think so, please help me check:

Michal, Roman: Could you please review my argumentation below, so I can
understand what exactly is wrong ?

> > 
> > IIUC, we can approach this by dividing the problem in two working modes:
> > 1 - Normal, meaning no drain_all_stock() running.
> > 2 - Draining, grouping together pre-OOM and userspace 'config' : changing,
> > destroying, reconfiguring a memcg.
> > 
> > For (1), we will have (ideally) only local cpu working on the percpu struct.
> > This mode will not have any kind of contention, because each CPU will hold it's
> > own spinlock only. 
> > 
> > For (2), we will have a lot of drain_all_stock() running. This will mean a lot
> > of schedule_work_on() running (on upstream) or possibly causing contention, i.e.
> > local cpus having to wait for a lock to get their cache, on the patch proposal.
> > 
> > Ok, given the above is correct:
> > 
> > # Some arguments point that (1) becomes slower with this patch.
> > 
> > This is partially true: while test 2.2 pointed that local cpu functions running
> > time had became slower by a few cycles, test 2.4 points that the userspace
> > perception of it was that the syscalls and pagefaulting actually became faster:
> > 
> > During some debugging tests before getting the performance on test 2.4, I
> > noticed that the 'syscall + write' test would call all those functions that
> > became slower on test 2.2. Those functions were called multiple millions of
> > times during a single test, and still the patched version performance test
> > returned faster for test 2.4 than upstream version. Maybe the functions became
> > slower, but overall the usage of them in the usual context became faster.
> > 
> > Is not that a small improvement?
> > 
> > # Regarding (2), I notice that we fear contention 
> > 
> > While this seems to be the harder part of the discussion, I think we have enough
> > data to deal with it. 
> > 
> > In which case contention would be a big problem here? 
> > IIUC it would be when a lot of drain_all_stock() get running because the memory
> > limit is getting near. I mean, having the user to create / modify a memcg
> > multiple times a second for a while is not something that is expected, IMHO.
> 
> Considering that the use of spinlocks with remote draining is the more general solution,
> what would be a test-case to demonstrate a contention problem?

IIUC we could try to reproduce a memory tight workload that keeps allocating /
freeing from different cpus (without hitting OOM).

Michal, Roman: Is that correct? You have any workload like that so we can test?

> 
> > Now, if I assumed correctly and the case where contention could be a problem is
> > on a memcg with high memory pressure, then we have the argument that Marcelo
> > Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page
> > allocation brought better results than local_locks + schedule_work_on().
> > 
> > I mean, while contention would cause the cpu to wait for a while before getting
> > the lock for allocating a page from cache, something similar would happen with
> > schedule_work_on(), which would force the current task to wait while the
> > draining happens locally. 
> > 
> > What I am able to see is that, for each drain_all_stock(), for each cpu getting
> > drained we have the option to (a) (sometimes) wait for a lock to be freed, or
> > (b) wait for a whole context switch to happen.
> > And IIUC, (b) is much slower than (a) on average, and this is what causes the
> > improved performance seen in [P1].
> 
> Moreover, there is a delay for the remote CPU to execute a work item 
> (there could be high priority tasks, IRQ handling on the remote CPU,
> which delays execution of the work item even further).
> 
> Also, the other point is that the spinlock already exists for
> PREEMPT_RT (which means that any potential contention issue 
> with the spinlock today is limited to PREEMPT_RT users).
> 
> So it would be good to point out a specific problematic 
> testcase/scenario with using the spinlock in this particular case?

Oh, that's a good point, but IIUC spin_lock() replaces local_lock() in memcg,
meaning it will always run in the same CPU, and there should only be any
contention if the memcg local cpu functions get preempted by something that
calls a memcg local cpu function.

> 
> > (I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for
> > it to run on the remote CPU may not be that different, since the cacheline is
> > already writen to by the remote cpu on Upstream)
> > 
> > Also according to test 2.2, for the patched version, drain_local_stock() have
> > gotten faster (much faster for 128 cpus), even though it does all the draining
> > instead of just scheduling it on the other cpus. 
> > I mean, summing that to the brief nature of local cpu functions, we may not hit
> > contention as much as we are expected.
> > 
> > ##
> > 
> > Sorry for the long text.
> > I may be missing some point, please let me know if that's the case.
> > 
> > Thanks a lot for reviewing!
> > Leo
> > 
> > [P1]: https://lkml.org/lkml/2022/6/13/2769
> > 
> > 
> 

Thanks!
Leo