On Tue, Jul 16, 2024 at 8:00 PM Waiman Long <longman@xxxxxxxxxx> wrote: > > On 7/16/24 20:35, Yosry Ahmed wrote: > > [..] > >> > >> This is a clean (meaning no cadvisor interference) example of kswapd > >> starting simultaniously on many NUMA nodes, that in 27 out of 98 cases > >> hit the race (which is handled in V6 and V7). > >> > >> The BPF "cnt" maps are getting cleared every second, so this > >> approximates per sec numbers. This patch reduce pressure on the lock, > >> but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full > >> flushes approx 37 per sec (every 27 ms). On the positive side > >> ongoing_flusher mitigation stopped 98 per sec of these. > >> > >> In this clean kswapd case the patch removes the lock contention issue > >> for kswapd. The lock_contended cases 27 seems to be all related to > >> handled_race cases 27. > >> > >> The remaning high flush rate should also be addressed, and we should > >> also work on aproaches to limit this like my ealier proposal[1]. > > I honestly don't think a high number of flushes is a problem on its > > own as long as we are not spending too much time flushing, especially > > when we have magnitude-based thresholding so we know there is > > something to flush (although it may not be relevant to what we are > > doing). > > > > If we keep observing a lot of lock contention, one thing that I > > thought about is to have a variant of spin_lock with a timeout. This > > limits the flushing latency, instead of limiting the number of flushes > > (which I believe is the wrong metric to optimize). > > Except for semaphore, none of our locking primitives allow for a timeout > parameter. For sleeping locks, I don't think it is hard to add variants > with timeout parameter, but not the spinning locks. Thanks for pointing this out. I am assuming a mutex with a timeout will also address the priority inversion problem that Shakeel was talking about AFAICT.