Re: [PATCH v1] cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Mon, 6 May 2024 09:22:11 -0700

On Mon, May 06, 2024 at 02:03:47PM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 03/05/2024 21.18, Shakeel Butt wrote:
[...]
> > 
> > Hmm 128 usec is actually unexpectedly high.
> 
> > How does the cgroup hierarchy on your system looks like?
> I didn't design this, so hopefully my co-workers can help me out here? (To
> @Daniel or @Jon)
> 
> My low level view is that, there are 17 top-level directories in
> /sys/fs/cgroup/.
> There are 649 cgroups (counting occurrence of memory.stat).
> There are two directories that contain the major part.
>  - /sys/fs/cgroup/system.slice = 379
>  - /sys/fs/cgroup/production.slice = 233
>  - (production.slice have directory two levels)
>  - remaining 37
> 
> We are open to changing this if you have any advice?
> (@Daniel and @Jon are actually working on restructuring this)
> 
> > How many cgroups have actual workloads running?
> Do you have a command line trick to determine this?
> 

The rstat infra maintains a per-cpu cgroup update tree to only flush
stats of cgroups which have seen updates. So, even if you have large
number of cgroups but the workload is active in small number of cgroups,
the update tree should be much smaller. That is the reason I asked these
questions. I don't have any advise yet. At the I am trying to understand
the usage and then hopefully work on optimizing those.

> 
> > Can the network softirqs run on any cpus or smaller
> > set of cpus? I am assuming these softirqs are processing packets from
> > any or all cgroups and thus have larger cgroup update tree.
> 
> Softirq and specifically NET_RX is running half of the cores (e.g. 64).
> (I'm looking at restructuring this allocation)
> 
> > I wonder if
> > you comment out MEMCG_SOCK stat update and still see the same holding
> > time.
> > 
> 
> It doesn't look like MEMCG_SOCK is used.
> 
> I deduct you are asking:
>  - What is the update count for different types of mod_memcg_state() calls?
> 
> // Dumped via BTF info
> enum memcg_stat_item {
>         MEMCG_SWAP = 43,
>         MEMCG_SOCK = 44,
>         MEMCG_PERCPU_B = 45,
>         MEMCG_VMALLOC = 46,
>         MEMCG_KMEM = 47,
>         MEMCG_ZSWAP_B = 48,
>         MEMCG_ZSWAPPED = 49,
>         MEMCG_NR_STAT = 50,
> };
> 
> sudo bpftrace -e 'kfunc:vmlinux:__mod_memcg_state{@[args->idx]=count()}
> END{printf("\nEND time elapsed: %d sec\n", elapsed / 1000000000);}'
> Attaching 2 probes...
> ^C
> END time elapsed: 99 sec
> 
> @[45]: 17996
> @[46]: 18603
> @[43]: 61858
> @[47]: 21398919
> 
> It seems clear that MEMCG_KMEM = 47 is the main "user".
>  - 21398919/99 = 216150 calls per sec
> 
> Could someone explain to me what this MEMCG_KMEM is used for?
> 

MEMCG_KMEM is the kernel memory charged to a cgroup. It also contains
the untyped kernel memory which is not included in kernel_stack,
pagetables, percpu, vmalloc, slab e.t.c.

The reason I asked about MEMCG_SOCK was that it might be causing larger
update trees (more cgroups) on CPUs processing the NET_RX.

Anyways did the mutex change helped your production workload regarding
latencies?