Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

Waiman Long <longman@xxxxxxxxxx> · Tue, 16 Jul 2024 23:00:38 -0400

On 7/16/24 20:35, Yosry Ahmed wrote:
[..]

This is a clean (meaning no cadvisor interference) example of kswapd
starting simultaniously on many NUMA nodes, that in 27 out of 98 cases
hit the race (which is handled in V6 and V7).

The BPF "cnt" maps are getting cleared every second, so this
approximates per sec numbers.  This patch reduce pressure on the lock,
but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full
flushes approx 37 per sec (every 27 ms). On the positive side
ongoing_flusher mitigation stopped 98 per sec of these.

In this clean kswapd case the patch removes the lock contention issue
for kswapd. The lock_contended cases 27 seems to be all related to
handled_race cases 27.

The remaning high flush rate should also be addressed, and we should
also work on aproaches to limit this like my ealier proposal[1].
I honestly don't think a high number of flushes is a problem on its
own as long as we are not spending too much time flushing, especially
when we have magnitude-based thresholding so we know there is
something to flush (although it may not be relevant to what we are
doing).

If we keep observing a lot of lock contention, one thing that I
thought about is to have a variant of spin_lock with a timeout. This
limits the flushing latency, instead of limiting the number of flushes
(which I believe is the wrong metric to optimize).

Except for semaphore, none of our locking primitives allow for a timeout 
parameter. For sleeping locks, I don't think it is hard to add variants 
with timeout parameter, but not the spinning locks.

Cheers,
Longman