On Mon, May 6, 2024 at 9:22 AM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > On Mon, May 06, 2024 at 02:03:47PM +0200, Jesper Dangaard Brouer wrote: > > > > > > On 03/05/2024 21.18, Shakeel Butt wrote: > [...] > > > > > > Hmm 128 usec is actually unexpectedly high. > > > > > How does the cgroup hierarchy on your system looks like? > > I didn't design this, so hopefully my co-workers can help me out here? (To > > @Daniel or @Jon) > > > > My low level view is that, there are 17 top-level directories in > > /sys/fs/cgroup/. > > There are 649 cgroups (counting occurrence of memory.stat). > > There are two directories that contain the major part. > > - /sys/fs/cgroup/system.slice = 379 > > - /sys/fs/cgroup/production.slice = 233 > > - (production.slice have directory two levels) > > - remaining 37 > > > > We are open to changing this if you have any advice? > > (@Daniel and @Jon are actually working on restructuring this) > > > > > How many cgroups have actual workloads running? > > Do you have a command line trick to determine this? > > > > The rstat infra maintains a per-cpu cgroup update tree to only flush > stats of cgroups which have seen updates. So, even if you have large > number of cgroups but the workload is active in small number of cgroups, > the update tree should be much smaller. That is the reason I asked these > questions. I don't have any advise yet. At the I am trying to understand > the usage and then hopefully work on optimizing those. > > > > > > Can the network softirqs run on any cpus or smaller > > > set of cpus? I am assuming these softirqs are processing packets from > > > any or all cgroups and thus have larger cgroup update tree. > > > > Softirq and specifically NET_RX is running half of the cores (e.g. 64). > > (I'm looking at restructuring this allocation) > > > > > I wonder if > > > you comment out MEMCG_SOCK stat update and still see the same holding > > > time. > > > > > > > It doesn't look like MEMCG_SOCK is used. > > > > I deduct you are asking: > > - What is the update count for different types of mod_memcg_state() calls? > > > > // Dumped via BTF info > > enum memcg_stat_item { > > MEMCG_SWAP = 43, > > MEMCG_SOCK = 44, > > MEMCG_PERCPU_B = 45, > > MEMCG_VMALLOC = 46, > > MEMCG_KMEM = 47, > > MEMCG_ZSWAP_B = 48, > > MEMCG_ZSWAPPED = 49, > > MEMCG_NR_STAT = 50, > > }; > > > > sudo bpftrace -e 'kfunc:vmlinux:__mod_memcg_state{@[args->idx]=count()} > > END{printf("\nEND time elapsed: %d sec\n", elapsed / 1000000000);}' > > Attaching 2 probes... > > ^C > > END time elapsed: 99 sec > > > > @[45]: 17996 > > @[46]: 18603 > > @[43]: 61858 > > @[47]: 21398919 > > > > It seems clear that MEMCG_KMEM = 47 is the main "user". > > - 21398919/99 = 216150 calls per sec > > > > Could someone explain to me what this MEMCG_KMEM is used for? > > > > MEMCG_KMEM is the kernel memory charged to a cgroup. It also contains > the untyped kernel memory which is not included in kernel_stack, > pagetables, percpu, vmalloc, slab e.t.c. > > The reason I asked about MEMCG_SOCK was that it might be causing larger > update trees (more cgroups) on CPUs processing the NET_RX. We pass cgroup.memory=nosocket in the kernel cmdline: * https://lore.kernel.org/lkml/CABWYdi0G7cyNFbndM-ELTDAR3x4Ngm0AehEp5aP0tfNkXUE+Uw@xxxxxxxxxxxxxx/ > Anyways did the mutex change helped your production workload regarding > latencies?