[..] > > > This is a clean (meaning no cadvisor interference) example of kswapd > starting simultaniously on many NUMA nodes, that in 27 out of 98 cases > hit the race (which is handled in V6 and V7). > > The BPF "cnt" maps are getting cleared every second, so this > approximates per sec numbers. This patch reduce pressure on the lock, > but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full > flushes approx 37 per sec (every 27 ms). On the positive side > ongoing_flusher mitigation stopped 98 per sec of these. > > In this clean kswapd case the patch removes the lock contention issue > for kswapd. The lock_contended cases 27 seems to be all related to > handled_race cases 27. > > The remaning high flush rate should also be addressed, and we should > also work on aproaches to limit this like my ealier proposal[1]. I honestly don't think a high number of flushes is a problem on its own as long as we are not spending too much time flushing, especially when we have magnitude-based thresholding so we know there is something to flush (although it may not be relevant to what we are doing). If we keep observing a lot of lock contention, one thing that I thought about is to have a variant of spin_lock with a timeout. This limits the flushing latency, instead of limiting the number of flushes (which I believe is the wrong metric to optimize). It also seems to me that we are doing a flush each 27ms, and your proposed threshold was once per 50ms. It doesn't seem like a fundamental difference. I am also wondering how many more flushes could be skipped if we handle the case of multiple ongoing flushers (whether by using a mutex, or making it a per-cgroup property as I suggested earlier).