Re: [PATCH v1] cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Mon, 6 May 2024 14:03:47 +0200

On 03/05/2024 21.18, Shakeel Butt wrote:
On Fri, May 03, 2024 at 04:00:20PM +0200, Jesper Dangaard Brouer wrote:

[...]

I may have mistakenly thinking the lock hold time refers to just the
cpu_lock. Your reported times here are about the cgroup_rstat_lock.
Right? If so, the numbers make sense to me.

True, my reported number here are about the cgroup_rstat_lock.
Glad to hear, we are more aligned then :-)

Given I just got some prod machines online with this patch
cgroup_rstat_cpu_lock tracepoints, I can give you some early results,
about hold-time for the cgroup_rstat_cpu_lock.

Oh you have already shared the preliminary data.

 From this oneliner bpftrace commands:

   sudo bpftrace -e '
          tracepoint:cgroup:cgroup_rstat_cpu_lock_contended {
            @start[tid]=nsecs; @cnt[probe]=count()}
          tracepoint:cgroup:cgroup_rstat_cpu_locked {
            $now=nsecs;
            if (args->contended) {
              @wait_per_cpu_ns=hist($now-@start[tid]); delete(@start[tid]);}
            @cnt[probe]=count(); @locked[tid]=$now}
          tracepoint:cgroup:cgroup_rstat_cpu_unlock {
            $now=nsecs;
            @locked_per_cpu_ns=hist($now-@locked[tid]); delete(@locked[tid]);
            @cnt[probe]=count()}
          interval:s:1 {time("%H:%M:%S "); print(@wait_per_cpu_ns);
            print(@locked_per_cpu_ns); print(@cnt); clear(@cnt);}'

Results from one 1 sec period:

13:39:55 @wait_per_cpu_ns:
[512, 1K)              3 |      |
[1K, 2K)              12 |@      |
[2K, 4K)             390
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)              70 |@@@@@@@@@      |
[8K, 16K)             24 |@@@      |
[16K, 32K)           183 |@@@@@@@@@@@@@@@@@@@@@@@@      |
[32K, 64K)            11 |@      |

@locked_per_cpu_ns:
[256, 512)         75592 |@      |
[512, 1K)        2537357
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)          528615 |@@@@@@@@@@      |
[2K, 4K)          168519 |@@@      |
[4K, 8K)          162039 |@@@      |
[8K, 16K)         100730 |@@      |
[16K, 32K)         42276 |      |
[32K, 64K)          1423 |      |
[64K, 128K)           89 |      |

  @cnt[tracepoint:cgroup:cgroup_rstat_cpu_lock_contended]: 3 /sec
  @cnt[tracepoint:cgroup:cgroup_rstat_cpu_unlock]: 3200  /sec
  @cnt[tracepoint:cgroup:cgroup_rstat_cpu_locked]: 3200  /sec

So, we see "flush-code-path" per-CPU-holding @locked_per_cpu_ns isn't
exceeding 128 usec.

Hmm 128 usec is actually unexpectedly high. 

How does the cgroup hierarchy on your system looks like? 
I didn't design this, so hopefully my co-workers can help me out here? 
(To @Daniel or @Jon)

My low level view is that, there are 17 top-level directories in 
/sys/fs/cgroup/.
There are 649 cgroups (counting occurrence of memory.stat).
There are two directories that contain the major part.
 - /sys/fs/cgroup/system.slice = 379
 - /sys/fs/cgroup/production.slice = 233
 - (production.slice have directory two levels)
 - remaining 37

We are open to changing this if you have any advice?
(@Daniel and @Jon are actually working on restructuring this)

How many cgroups have actual workloads running?
Do you have a command line trick to determine this?

Can the network softirqs run on any cpus or smaller
set of cpus? I am assuming these softirqs are processing packets from
any or all cgroups and thus have larger cgroup update tree. 

Softirq and specifically NET_RX is running half of the cores (e.g. 64).
(I'm looking at restructuring this allocation)

I wonder if
you comment out MEMCG_SOCK stat update and still see the same holding
time.

It doesn't look like MEMCG_SOCK is used.

I deduct you are asking:
 - What is the update count for different types of mod_memcg_state() calls?

// Dumped via BTF info
enum memcg_stat_item {
        MEMCG_SWAP = 43,
        MEMCG_SOCK = 44,
        MEMCG_PERCPU_B = 45,
        MEMCG_VMALLOC = 46,
        MEMCG_KMEM = 47,
        MEMCG_ZSWAP_B = 48,
        MEMCG_ZSWAPPED = 49,
        MEMCG_NR_STAT = 50,
};

sudo bpftrace -e 'kfunc:vmlinux:__mod_memcg_state{@[args->idx]=count()} 
END{printf("\nEND time elapsed: %d sec\n", elapsed / 1000000000);}'
Attaching 2 probes...
^C
END time elapsed: 99 sec

@[45]: 17996
@[46]: 18603
@[43]: 61858
@[47]: 21398919

It seems clear that MEMCG_KMEM = 47 is the main "user".
 - 21398919/99 = 216150 calls per sec

Could someone explain to me what this MEMCG_KMEM is used for?

My latency requirements, to avoid RX-queue overflow, with 1024 slots,
running at 25 Gbit/s, is 27.6 usec with small packets, and 500 usec
(0.5ms) with MTU size packets.  This is very close to my latency
requirements.

--Jesper