On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote: > On Thu 03-11-22 11:59:20, Leonardo Brás wrote: > > On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote: > > > On Tue 01-11-22 23:02:40, Leonardo Bras wrote: > > > > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus > > > > closer (NUMA) to any desired CPU, instead of only the current CPU. > > > > > > > > ### Performance argument that motivated the change: > > > > There could be an argument of why would that be needed, since the current > > > > CPU is probably acessing the current cacheline, and so having a CPU closer > > > > to the current one is always the best choice since the cache invalidation > > > > will take less time. OTOH, there could be cases like this which uses > > > > perCPU variables, and we can have up to 3 different CPUs touching the > > > > cacheline: > > > > > > > > C1 - Isolated CPU: The perCPU data 'belongs' to this one > > > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu > > > > C3 - Housekeeping CPU: This one will do the work > > > > > > > > Most of the times the cacheline is touched, it should be by C1. Some times > > > > a C2 will schedule work to run on C3, since C1 is isolated. > > > > > > > > If C1 and C2 are in different NUMA nodes, we could have C3 either in > > > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node > > > > (housekeeping_any_cpu_from(C1). > > > > > > > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3 > > > > tries to get cacheline exclusivity, and then a slower invalidation when > > > > this happens in C1, when it's working in its data. > > > > > > > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3 > > > > tries to get cacheline exclusivity, and then a faster invalidation when > > > > this happens in C1. > > > > > > > > The thing is: it should be better to wait less when doing kernel work > > > > on an isolated CPU, even at the cost of some housekeeping CPU waiting > > > > a few more cycles. > > > > ### > > > > > > > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from > > > > local_lock to spinlocks, so it can be later used to do remote percpu > > > > cache draining on patch #3. Most performance concerns should be pointed > > > > in the commit log. > > > > > > > > Patch #3 implements the remote per-CPU cache drain, making use of both > > > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should > > > > introduce an extra function call and a single test to check if the CPU is > > > > isolated. > > > > > > > > On scenarios with isolation enabled on boot, it will also introduce an > > > > extra test to check in the cpumask if the CPU is isolated. If it is, > > > > there will also be an extra read of the cpumask to look for a > > > > housekeeping CPU. > > > > > > > Hello Michael, thanks for reviewing! > > > > > This is a rather deep dive in the cache line usage but the most > > > important thing is really missing. Why do we want this change? From the > > > context it seems that this is an actual fix for isolcpu= setup when > > > remote (aka non isolated activity) interferes with isolated cpus by > > > scheduling pcp charge caches on those cpus. > > > > > > Is this understanding correct? > > > > That's correct! The idea is to avoid scheduling work to isolated CPUs. > > > > > If yes, how big of a problem that is? > > > > The use case I have been following requires both isolcpus= and PREEMPT_RT, since > > the isolated CPUs will be running a real-time workload. In this scenario, > > getting any work done instead of the real-time workload may cause the system to > > miss a deadline, which can be bad. > > OK, I see. But is memcg charging actually a RT friendly operation in the > first place? Please note that this path can trigger memory reclaim and > that is when any RT expectations are simply going down the drain. I understand the spent time for charging is unpredictable as you said, since a lot of slow stuff may or may not happen. > > > > If you want a remote draining then > > > you need some sort of locking (currently we rely on local lock). How > > > come this locking is not going to cause a different form of disturbance? > > > > If I did everything right, most of the extra work should be done either in non- > > isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches > > will be happening on a housekeeping CPU, and the locking cost should be paid > > there as we want to avoid doing that in the isolated CPUs. Sorry, I think this caused a misunderstanding: I meant "the pcp charge cache drain will be happening on a housekeeping CPU, ..." > > > > I understand there will be a locking cost being paid in the isolated CPUs when: > > a) The isolated CPU is requesting the stock drain, > > b) When the isolated CPUs do a syscall and end up using the protected structure > > the first time after a remote drain. > > And anytime the charging path (consume_stock resp. refill_stock) > contends with the remote draining which is out of control of the RT > task. It is true that the RT kernel will turn that spin lock into a > sleeping RT lock and that could help with potential priority inversions > but still quite costly thing I would expect. > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload > > should not expect the syscalls to be have a predictable time, so it should be > > fine. > > Now I am not sure I understand. If you do not consider charging path to > be RT sensitive then why is this needed in the first place? What else > would be populating the pcp cache on the isolated cpu? IRQs? I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu time with the RT workload, we can have preemption of the RT workload, which is a problem for meeting the deadlines. One way I thought to solve that was introducing a remote drain, which would require a different strategy for locking, since not all accesses to the pcp caches would happen on a local CPU. Then I tried to weight the costs of this, so the solution would introduce as little overhead as possible on no-isolation scenarios. Also, for isolation scenarios, I tried to put most of the overheads into the housekeeping CPUs, and the remaining on the syscalls, which are also expected to be non-predictable. Not sure if I could answer your question, though. Please let me know in case I missed anything. Thanks for helping me make it more clear! Best regards, Leo