On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote: > On Tue 01-11-22 23:02:40, Leonardo Bras wrote: > > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus > > closer (NUMA) to any desired CPU, instead of only the current CPU. > > > > ### Performance argument that motivated the change: > > There could be an argument of why would that be needed, since the current > > CPU is probably acessing the current cacheline, and so having a CPU closer > > to the current one is always the best choice since the cache invalidation > > will take less time. OTOH, there could be cases like this which uses > > perCPU variables, and we can have up to 3 different CPUs touching the > > cacheline: > > > > C1 - Isolated CPU: The perCPU data 'belongs' to this one > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu > > C3 - Housekeeping CPU: This one will do the work > > > > Most of the times the cacheline is touched, it should be by C1. Some times > > a C2 will schedule work to run on C3, since C1 is isolated. > > > > If C1 and C2 are in different NUMA nodes, we could have C3 either in > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node > > (housekeeping_any_cpu_from(C1). > > > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3 > > tries to get cacheline exclusivity, and then a slower invalidation when > > this happens in C1, when it's working in its data. > > > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3 > > tries to get cacheline exclusivity, and then a faster invalidation when > > this happens in C1. > > > > The thing is: it should be better to wait less when doing kernel work > > on an isolated CPU, even at the cost of some housekeeping CPU waiting > > a few more cycles. > > ### > > > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from > > local_lock to spinlocks, so it can be later used to do remote percpu > > cache draining on patch #3. Most performance concerns should be pointed > > in the commit log. > > > > Patch #3 implements the remote per-CPU cache drain, making use of both > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should > > introduce an extra function call and a single test to check if the CPU is > > isolated. > > > > On scenarios with isolation enabled on boot, it will also introduce an > > extra test to check in the cpumask if the CPU is isolated. If it is, > > there will also be an extra read of the cpumask to look for a > > housekeeping CPU. > Hello Michael, thanks for reviewing! > This is a rather deep dive in the cache line usage but the most > important thing is really missing. Why do we want this change? From the > context it seems that this is an actual fix for isolcpu= setup when > remote (aka non isolated activity) interferes with isolated cpus by > scheduling pcp charge caches on those cpus. > > Is this understanding correct? That's correct! The idea is to avoid scheduling work to isolated CPUs. > If yes, how big of a problem that is? The use case I have been following requires both isolcpus= and PREEMPT_RT, since the isolated CPUs will be running a real-time workload. In this scenario, getting any work done instead of the real-time workload may cause the system to miss a deadline, which can be bad. > If you want a remote draining then > you need some sort of locking (currently we rely on local lock). How > come this locking is not going to cause a different form of disturbance? If I did everything right, most of the extra work should be done either in non- isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches will be happening on a housekeeping CPU, and the locking cost should be paid there as we want to avoid doing that in the isolated CPUs. I understand there will be a locking cost being paid in the isolated CPUs when: a) The isolated CPU is requesting the stock drain, b) When the isolated CPUs do a syscall and end up using the protected structure the first time after a remote drain. Both (a) and (b) should happen during a syscall, and IIUC the a rt workload should not expect the syscalls to be have a predictable time, so it should be fine. Thanks for helping me explain the case! Best regards, Leo