Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Thu, 18 Jul 2024 10:12:39 +0200

On 17/07/2024 18.49, Yosry Ahmed wrote:
On Wed, Jul 17, 2024 at 9:36 AM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote:

On 17/07/2024 02.35, Yosry Ahmed wrote:
[..]

This is a clean (meaning no cadvisor interference) example of kswapd
starting simultaniously on many NUMA nodes, that in 27 out of 98 cases
hit the race (which is handled in V6 and V7).

The BPF "cnt" maps are getting cleared every second, so this
approximates per sec numbers.  This patch reduce pressure on the lock,
but we are still seeing (kfunc:vmlinux:cgroup_rstat_flush_locked) full
flushes approx 37 per sec (every 27 ms). On the positive side
ongoing_flusher mitigation stopped 98 per sec of these.

In this clean kswapd case the patch removes the lock contention issue
for kswapd. The lock_contended cases 27 seems to be all related to
handled_race cases 27.

The remaning high flush rate should also be addressed, and we should
also work on aproaches to limit this like my ealier proposal[1].

I honestly don't think a high number of flushes is a problem on its
own as long as we are not spending too much time flushing, especially
when we have magnitude-based thresholding so we know there is
something to flush (although it may not be relevant to what we are
doing).

We are "spending too much time flushing" see below.

If we keep observing a lot of lock contention, one thing that I
thought about is to have a variant of spin_lock with a timeout. This
limits the flushing latency, instead of limiting the number of flushes
(which I believe is the wrong metric to optimize).

It also seems to me that we are doing a flush each 27ms, and your
proposed threshold was once per 50ms. It doesn't seem like a
fundamental difference.

Looking at the production numbers for the time the lock is held for level 0:

@locked_time_level[0]:
[4M, 8M)     623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
[8M, 16M)    860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16M, 32M)   295 |@@@@@@@@@@@@@@@@@                                   |
[32M, 64M)   275 |@@@@@@@@@@@@@@@@                                    |

The time is in nanosec, so M corresponds to ms (milliseconds).

With 36 flushes per second (as shown earlier) this is a flush every
27.7ms.  It is not unreasonable (from above data) that the flush time
also spend 27ms, which means that we spend a full CPU second flushing.
That is spending too much time flushing.

This around 1 sec CPU usage for kswapd is also quite clear in the
attached grafana graph for when server was rebooted into this V7 kernel.

I choose 50ms because at the time I saw flush taking around 30ms, and I
view the flush time as queue service-time.  When arrival-rate is faster
than service-time, then a queue will form.  So, choosing 50ms as
arrival-rate gave me some headroom.  As I mentioned earlier, optimally
this threshold should be dynamically measured.

Thanks for the data. Yeah this doesn't look good.

Does it make sense to just throttle flushers at some point to increase
the chances of coalescing multiple flushers?

Otherwise I think it makes sense in this case to ratelimit flushing in
general. Although instead of just checking how much time elapsed since
the last flush, can we use something like __ratelimit()?

This will make sure that we skip flushes when we actually have a high
rate of flushing over a period of time, not because two flushes
happened to be requested in close succession and the flushing rate is
generally low.

I really think "time elapsed since the last flush" is the right solution
here.  As, we *do* want to catch the case you describe "two flushes
happened to be requested in close succession and the flushing rate is
generally low."

(After this patch fixing the lock contention triggered by kswapd).
The remaining problem with kswapd is that those flushes that doesn't
"collide" on the lock, will be flushing in close succession.  And we
likely have a generally low flushing rate, until kswapd starts up.

Some production data from a "slow" period where only kswapd is active:

05:59:32 @ongoing_flusher_cnt[kswapd11]: 1
@ongoing_flusher_cnt[kswapd7]: 1
@ongoing_flusher_cnt[kswapd3]: 1
@ongoing_flusher_cnt[kswapd5]: 1
@ongoing_flusher_cnt[kswapd10]: 1
@ongoing_flusher_cnt[kswapd6]: 2
@ongoing_flusher_cnt[kswapd8]: 2
@ongoing_flusher_cnt[kswapd1]: 2
@ongoing_flusher_cnt[kswapd9]: 2
@ongoing_flusher_cnt[kswapd0]: 2
@ongoing_flusher_cnt[kswapd2]: 2
@ongoing_flusher_cnt[kswapd4]: 2
@ongoing_flusher_cnt[handled_race]: 2
@ongoing_flusher_cnt[all]: 14
@cnt[tracepoint:cgroup:cgroup_rstat_lock_contended]: 2
@cnt[tracepoint:cgroup:cgroup_ongoing_flusher_wait]: 10
@cnt[kfunc:vmlinux:cgroup_rstat_flush_locked]: 43
@cnt[tracepoint:cgroup:cgroup_rstat_locked]: 51

We observe that ongoing_flusher scheme saves/avoids 14 of the flushes
great, but we still have 43 flushes in this period.  I think only kswapd
is doing the flushing here, so I claim 41 flushes are likely unnecessary
(as there are indication of 2 kswapd startups in the period).  Also
observe that some of the kswapdNN processes only have
ongoing_flusher_cnt 1, which indicate they didn't fully "collide" with
an ongoing flusher, while others have 2.

--Jesper