Re: Advice on cgroup rstat lock

Waiman Long <longman@xxxxxxxxxx> · Tue, 9 Apr 2024 11:37:31 -0400

On 4/9/24 07:08, Jesper Dangaard Brouer wrote:
Let move this discussion upstream.

On 22/03/2024 19.32, Yosry Ahmed wrote:
[..]
There was a couple of series that made all calls to
cgroup_rstat_flush() sleepable, which allows the lock to be dropped
(and IRQs enabled) in between CPU iterations. This fixed a similar
problem that we used to face (except in our case, we saw hard lockups
in extreme scenarios):
https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@xxxxxxxxxx/ 

https://lore.kernel.org/lkml/20230421174020.2994750-1-yosryahmed@xxxxxxxxxx/ 

I've only done the 6.6 backport, and these were in 6.5/6.6.

Given I have these in my 6.6 kernel. You are basically saying I should
be able to avoid IRQ-disable for the lock, right?

My main problem with the global cgroup_rstat_lock[3] is it disables IRQs
and (thereby also) BH/softirq (spin_lock_irq).  This cause production
issues elsewhere, e.g. we are seeing network softirq "not-able-to-run"
latency issues (debug via softirq_net_latency.bt [5]).

  [3] 
https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L10
  [5] 
https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt 

And between 6.1 to 6.6 we did observe an improvement in this area.
(Maybe I don't have to do the 6.1 backport if the 6.6 release plan 
progress)

I've had a chance to get running in prod for 6.6 backport.
As you can see in attached grafana heatmap pictures, we do observe an
improved/reduced softirq wait time.
These softirq "not-able-to-run" outliers is *one* of the prod issues we
observed.  As you can see, I still have other areas to improve/fix.

I am not very familiar with such heatmaps, but I am glad there is an
improvement with 6.6 and the backports. Let me know if there is
anything I could do to help with your effort.

The heatmaps give me an overview, but I needed a debugging tool, so I
developed some bpftrace scripts [1][2] I'm running on production.
To measure how long time we hold the cgroup rstat lock (results below).
Adding ACME and Daniel as I hope there is an easier way to measure lock
hold time and congestion. Notice tricky release/yield in
cgroup_rstat_flush_locked[4].

My production results on 6.6 with backported patches (below signature)
vs a our normal 6.6 kernel, with script [2]. The `@lock_time_hist_ns`
shows how long time the lock+IRQs were disabled (taking into account it
can be released in the loop [4]).

Patched kernel:

21:49:02  time elapsed: 43200 sec
@lock_time_hist_ns:
[2K, 4K)              61 |      |
[4K, 8K)             734 |      |
[8K, 16K)         121500 |@@@@@@@@@@@@@@@@      |
[16K, 32K)        385714 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)        145600 |@@@@@@@@@@@@@@@@@@@      |
[64K, 128K)       156873 |@@@@@@@@@@@@@@@@@@@@@      |
[128K, 256K)      261027 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[256K, 512K)      291986 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[512K, 1M)        101859 |@@@@@@@@@@@@@      |
[1M, 2M)           19866 |@@      |
[2M, 4M)           10146 |@      |
[4M, 8M)           30633 |@@@@      |
[8M, 16M)          40365 |@@@@@      |
[16M, 32M)         21650 |@@      |
[32M, 64M)          5842 |      |
[64M, 128M)            8 |      |

And normal 6.6 kernel:

21:48:32  time elapsed: 43200 sec
@lock_time_hist_ns:
[1K, 2K)              25 |      |
[2K, 4K)            1146 |      |
[4K, 8K)           59397 |@@@@      |
[8K, 16K)         571528 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[16K, 32K)        542648 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32K, 64K)        202810 |@@@@@@@@@@@@@      |
[64K, 128K)       134564 |@@@@@@@@@      |
[128K, 256K)       72870 |@@@@@      |
[256K, 512K)       56914 |@@@      |
[512K, 1M)         83140 |@@@@@      |
[1M, 2M)          170514 |@@@@@@@@@@@      |
[2M, 4M)          396304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[4M, 8M)          755537 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8M, 16M)         231222 |@@@@@@@@@@@@@@@      |
[16M, 32M)         76370 |@@@@@      |
[32M, 64M)          1043 |      |
[64M, 128M)           12 |      |

For the unpatched kernel we see more events in 4ms to 8ms bucket than
any other bucket.
For patched kernel, we clearly see a significant reduction of events in
the 4 ms to 64 ms area, but we still have some events in this area.  I'm
very happy to see these patches improves the situation.  But for network
processing I'm not happy to see events in area 16ms to 128ms area.  If
we can just avoid disabling IRQs/softirq for the lock, I would be happy.

How far can we go... could cgroup_rstat_lock be converted to a mutex?

The cgroup_rstat_lock was originally a mutex. It was converted to a 
spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with 
a spinlock"). Irq was disabled to enable calling from atomic context. 
Since commit 0a2dc6ac3329 ("cgroup: remove 
cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from 
atomic context anymore. Theoretically, we could change it back to a 
mutex or not disabling interrupt. That will require that the API cannot 
be called from atomic context going forward.

Cheers,
Longman