Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus

Michal Hocko <mhocko@xxxxxxxx> · Thu, 3 Nov 2022 16:31:47 +0100

On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote:
> > On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> > > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> > > closer (NUMA) to any desired CPU, instead of only the current CPU.
> > > 
> > > ### Performance argument that motivated the change:
> > > There could be an argument of why would that be needed, since the current
> > > CPU is probably acessing the current cacheline, and so having a CPU closer
> > > to the current one is always the best choice since the cache invalidation
> > > will take less time. OTOH, there could be cases like this which uses
> > > perCPU variables, and we can have up to 3 different CPUs touching the
> > > cacheline:
> > > 
> > > C1 - Isolated CPU: The perCPU data 'belongs' to this one
> > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> > > C3 - Housekeeping CPU: This one will do the work
> > > 
> > > Most of the times the cacheline is touched, it should be by C1. Some times
> > > a C2 will schedule work to run on C3, since C1 is isolated.
> > > 
> > > If C1 and C2 are in different NUMA nodes, we could have C3 either in
> > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> > > (housekeeping_any_cpu_from(C1). 
> > > 
> > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> > > tries to get cacheline exclusivity, and then a slower invalidation when
> > > this happens in C1, when it's working in its data.
> > > 
> > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> > > tries to get cacheline exclusivity, and then a faster invalidation when
> > > this happens in C1.
> > > 
> > > The thing is: it should be better to wait less when doing kernel work
> > > on an isolated CPU, even at the cost of some housekeeping CPU waiting
> > > a few more cycles.
> > > ###
> > > 
> > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> > > local_lock to spinlocks, so it can be later used to do remote percpu
> > > cache draining on patch #3. Most performance concerns should be pointed
> > > in the commit log.
> > > 
> > > Patch #3 implements the remote per-CPU cache drain, making use of both 
> > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> > > introduce an extra function call and a single test to check if the CPU is
> > > isolated. 
> > > 
> > > On scenarios with isolation enabled on boot, it will also introduce an
> > > extra test to check in the cpumask if the CPU is isolated. If it is,
> > > there will also be an extra read of the cpumask to look for a
> > > housekeeping CPU.
> > 
> 
> Hello Michael, thanks for reviewing!
> 
> > This is a rather deep dive in the cache line usage but the most
> > important thing is really missing. Why do we want this change? From the
> > context it seems that this is an actual fix for isolcpu= setup when
> > remote (aka non isolated activity) interferes with isolated cpus by
> > scheduling pcp charge caches on those cpus.
> > 
> > Is this understanding correct?
> 
> That's correct! The idea is to avoid scheduling work to isolated CPUs.
> 
> > If yes, how big of a problem that is?
> 
> The use case I have been following requires both isolcpus= and PREEMPT_RT, since
> the isolated CPUs will be running a real-time workload. In this scenario,
> getting any work done instead of the real-time workload may cause the system to
> miss a deadline, which can be bad. 

OK, I see. But is memcg charging actually a RT friendly operation in the
first place? Please note that this path can trigger memory reclaim and
that is when any RT expectations are simply going down the drain.

> >  If you want a remote draining then
> > you need some sort of locking (currently we rely on local lock). How
> > come this locking is not going to cause a different form of disturbance?
> 
> If I did everything right, most of the extra work should be done either in non-
> isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches
> will be happening on a housekeeping CPU, and the locking cost should be paid
> there as we want to avoid doing that in the isolated CPUs. 
> 
> I understand there will be a locking cost being paid in the isolated CPUs when:
> a) The isolated CPU is requesting the stock drain,
> b) When the isolated CPUs do a syscall and end up using the protected structure
> the first time after a remote drain.

And anytime the charging path (consume_stock resp. refill_stock)
contends with the remote draining which is out of control of the RT
task. It is true that the RT kernel will turn that spin lock into a
sleeping RT lock and that could help with potential priority inversions
but still quite costly thing I would expect.

> Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> should not expect the syscalls to be have a predictable time, so it should be
> fine.

Now I am not sure I understand. If you do not consider charging path to
be RT sensitive then why is this needed in the first place? What else
would be populating the pcp cache on the isolated cpu? IRQs?
-- 
Michal Hocko
SUSE Labs