On Fri, Jan 27, 2023 at 03:55:39AM -0300, Leonardo Brás wrote: > On Thu, 2023-01-26 at 20:13 +0100, Michal Hocko wrote: > > On Thu 26-01-23 15:14:25, Marcelo Tosatti wrote: > > > On Thu, Jan 26, 2023 at 08:45:36AM +0100, Michal Hocko wrote: > > > > On Wed 25-01-23 15:22:00, Marcelo Tosatti wrote: > > > > [...] > > > > > Remote draining reduces interruptions whether CPU > > > > > is marked as isolated or not: > > > > > > > > > > - Allows isolated CPUs from benefiting of pcp caching. > > > > > - Removes the interruption to non isolated CPUs. See for example > > > > > > > > > > https://lkml.org/lkml/2022/6/13/2769 > > > > > > > > This is talking about page allocato per cpu caches, right? In this patch > > > > we are talking about memcg pcp caches. Are you sure the same applies > > > > here? > > > > > > Both can stall the users of the drain operation. > > > > Yes. But it is important to consider who those users are. We are > > draining when > > - we are charging and the limit is hit so that memory reclaim > > has to be triggered. > > - hard, high limits are set and require memory reclaim. > > - force_empty - full memory reclaim for a memcg > > - memcg offlining - cgroup removel - quite a heavy operation as > > well. > > all those could be really costly kernel operations and they affect > > isolated cpu only if the same memcg is used by both isolated and non-isolated > > cpus. In other words those costly operations would have to be triggered > > from non-isolated cpus and those are to be expected to be stalled. It is > > the side effect of the local cpu draining that is scheduled that affects > > the isolated cpu as well. > > > > Is that more clear? > > I think so, please help me check: > > IIUC, we can approach this by dividing the problem in two working modes: > 1 - Normal, meaning no drain_all_stock() running. > 2 - Draining, grouping together pre-OOM and userspace 'config' : changing, > destroying, reconfiguring a memcg. > > For (1), we will have (ideally) only local cpu working on the percpu struct. > This mode will not have any kind of contention, because each CPU will hold it's > own spinlock only. > > For (2), we will have a lot of drain_all_stock() running. This will mean a lot > of schedule_work_on() running (on upstream) or possibly causing contention, i.e. > local cpus having to wait for a lock to get their cache, on the patch proposal. > > Ok, given the above is correct: > > # Some arguments point that (1) becomes slower with this patch. > > This is partially true: while test 2.2 pointed that local cpu functions running > time had became slower by a few cycles, test 2.4 points that the userspace > perception of it was that the syscalls and pagefaulting actually became faster: > > During some debugging tests before getting the performance on test 2.4, I > noticed that the 'syscall + write' test would call all those functions that > became slower on test 2.2. Those functions were called multiple millions of > times during a single test, and still the patched version performance test > returned faster for test 2.4 than upstream version. Maybe the functions became > slower, but overall the usage of them in the usual context became faster. > > Is not that a small improvement? > > # Regarding (2), I notice that we fear contention > > While this seems to be the harder part of the discussion, I think we have enough > data to deal with it. > > In which case contention would be a big problem here? > IIUC it would be when a lot of drain_all_stock() get running because the memory > limit is getting near. I mean, having the user to create / modify a memcg > multiple times a second for a while is not something that is expected, IMHO. Considering that the use of spinlocks with remote draining is the more general solution, what would be a test-case to demonstrate a contention problem? > Now, if I assumed correctly and the case where contention could be a problem is > on a memcg with high memory pressure, then we have the argument that Marcelo > Tosatti brought to the discussion[P1]: using spinlocks on percpu caches for page > allocation brought better results than local_locks + schedule_work_on(). > > I mean, while contention would cause the cpu to wait for a while before getting > the lock for allocating a page from cache, something similar would happen with > schedule_work_on(), which would force the current task to wait while the > draining happens locally. > > What I am able to see is that, for each drain_all_stock(), for each cpu getting > drained we have the option to (a) (sometimes) wait for a lock to be freed, or > (b) wait for a whole context switch to happen. > And IIUC, (b) is much slower than (a) on average, and this is what causes the > improved performance seen in [P1]. Moreover, there is a delay for the remote CPU to execute a work item (there could be high priority tasks, IRQ handling on the remote CPU, which delays execution of the work item even further). Also, the other point is that the spinlock already exists for PREEMPT_RT (which means that any potential contention issue with the spinlock today is limited to PREEMPT_RT users). So it would be good to point out a specific problematic testcase/scenario with using the spinlock in this particular case? > (I mean, waiting while drain_local_stock() runs in the local CPU vs waiting for > it to run on the remote CPU may not be that different, since the cacheline is > already writen to by the remote cpu on Upstream) > > Also according to test 2.2, for the patched version, drain_local_stock() have > gotten faster (much faster for 128 cpus), even though it does all the draining > instead of just scheduling it on the other cpus. > I mean, summing that to the brief nature of local cpu functions, we may not hit > contention as much as we are expected. > > ## > > Sorry for the long text. > I may be missing some point, please let me know if that's the case. > > Thanks a lot for reviewing! > Leo > > [P1]: https://lkml.org/lkml/2022/6/13/2769 > >