Re: [PATCH-patch v6] cgroup: Show # of subsystem CSSes in cgroup.stat

Kamalesh Babulal <kamalesh.babulal@xxxxxxxxxx> · Sat, 13 Jul 2024 21:20:19 +0530

On 7/13/24 4:59 AM, Waiman Long wrote:
> Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to
> help manage different structures in various cgroup subsystems by being
> an embedded element inside a larger structure like cpuset or mem_cgroup.
> 
> The /proc/cgroups file shows the number of cgroups for each of the
> subsystems.  With cgroup v1, the number of CSSes is the same as the
> number of cgroups.  That is not the case anymore with cgroup v2. The
> /proc/cgroups file cannot show the actual number of CSSes for the
> subsystems that are bound to cgroup v2.
> 
> So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
> we can't tell by looking at /proc/cgroups which cgroup subsystems may
> be responsible.
> 
> As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical
> cgroup.stat file is now being extended to show the number of live and
> dying CSSes associated with all the non-inhibited cgroup subsystems
> that have been bound to cgroup v2 as long as they are not both zero.
> The number includes CSSes in the current cgroup as well as in all the
> descendants underneath it.  This will help us pinpoint which subsystems
> are responsible for the increasing number of dying (nr_dying_descendants)
> cgroups.
> 
> The CSSes dying counts are stored in the cgroup structure itself
> instead of inside the CSS as suggested by Johannes. This will allow
> us to accurately track dying counts of cgroup subsystems that have
> recently been disabled in a cgroup. It is now possible that a zero
> subsystem number is coupled with a non-zero dying subsystem number.
> 
> The cgroup-v2.rst file is updated to discuss this new behavior.
> 
> With this patch applied, a sample output from root cgroup.stat file
> was shown below.
> 
> 	nr_descendants 56
> 	nr_subsys_cpuset 1
> 	nr_subsys_cpu 43
> 	nr_subsys_io 43
> 	nr_subsys_memory 56
> 	nr_subsys_perf_event 57
> 	nr_subsys_hugetlb 1
> 	nr_subsys_pids 56
> 	nr_subsys_rdma 1
> 	nr_subsys_misc 1
> 	nr_dying_descendants 30
> 	nr_dying_subsys_cpuset 0
> 	nr_dying_subsys_cpu 0
> 	nr_dying_subsys_io 0
> 	nr_dying_subsys_memory 30
> 	nr_dying_subsys_perf_event 0
> 	nr_dying_subsys_hugetlb 0
> 	nr_dying_subsys_pids 0
> 	nr_dying_subsys_rdma 0
> 	nr_dying_subsys_misc 0
> 
> Another sample output from system.slice/cgroup.stat was:
> 
> 	nr_descendants 32
> 	nr_subsys_cpu 30
> 	nr_subsys_io 30
> 	nr_subsys_memory 32
> 	nr_subsys_perf_event 33
> 	nr_subsys_pids 32
> 	nr_dying_descendants 32
> 	nr_dying_subsys_cpu 0
> 	nr_dying_subsys_io 0
> 	nr_dying_subsys_memory 32
> 	nr_dying_subsys_perf_event 0
> 	nr_dying_subsys_pids 0
> 

[...]

> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index c8e4b62b436a..73774c841100 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3669,12 +3669,43 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
>  static int cgroup_stat_show(struct seq_file *seq, void *v)
>  {
>  	struct cgroup *cgroup = seq_css(seq)->cgroup;
> +	struct cgroup_subsys_state *css;
> +	int dying_cnt[CGROUP_SUBSYS_COUNT];
> +	int ssid;
>  
>  	seq_printf(seq, "nr_descendants %d\n",
>  		   cgroup->nr_descendants);
> +
> +	/*
> +	 * Show the number of live and dying csses associated with each of
> +	 * non-inhibited cgroup subsystems that is either enabled in current
> +	 * cgroup or has non-zero dying count.
> +	 *
> +	 * Without proper lock protection, racing is possible. So the
> +	 * numbers may not be consistent when that happens.
> +	 */
> +	rcu_read_lock();
> +	for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
> +		dying_cnt[ssid] = -1;
> +		if (BIT(ssid) & cgrp_dfl_inhibit_ss_mask)
> +			continue;
> +		css = rcu_dereference_raw(cgroup->subsys[ssid]);
> +		if (!css && !cgroup->nr_dying_subsys[ssid])
> +			continue;

Sorry, If I have misread the discussion from the other thread about displaying
nr_descendants and nr_dying_subsys_<subsys>. I believe the idea was to print
them for enabled and disabled cgroup controllers, so the output stays consistent
and does not vary depending on the enabled controllers or previously enabled
controller with nr_dying_subsys > 0.

For example, on a rebooted vm:

# cd /sys/fs/cgroup/
# cat cgroup.subtree_control
cpu memory pids

# mkdir foo
# cat foo/cgroup.stat
nr_descendants 0
nr_subsys_cpu 1
nr_subsys_memory 1
nr_subsys_perf_event 1
nr_subsys_pids 1
nr_dying_descendants 0
nr_dying_subsys_cpu 0
nr_dying_subsys_memory 0
nr_dying_subsys_perf_event 0
nr_dying_subsys_pids 0

# echo '+cpuset' > cgroup.subtree_control

# cat foo/cgroup.stat
nr_descendants 0
nr_subsys_cpuset 1
nr_subsys_cpu 1
nr_subsys_memory 1
nr_subsys_perf_event 1
nr_subsys_pids 1
nr_dying_descendants 0
nr_dying_subsys_cpuset 0
nr_dying_subsys_cpu 0
nr_dying_subsys_memory 0
nr_dying_subsys_perf_event 0
nr_dying_subsys_pids 0

> +
> +		dying_cnt[ssid] = cgroup->nr_dying_subsys[ssid];
> +		seq_printf(seq, "nr_subsys_%s %d\n", cgroup_subsys[ssid]->name,
> +			   css ? (css->nr_descendants + 1) : 0);
> +	}
> +
>  	seq_printf(seq, "nr_dying_descendants %d\n",
>  		   cgroup->nr_dying_descendants);
> -
> +	for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
> +		if (dying_cnt[ssid] >= 0)
> +			seq_printf(seq, "nr_dying_subsys_%s %d\n",
> +				   cgroup_subsys[ssid]->name, dying_cnt[ssid]);
> +	}
> +	rcu_read_unlock();
>  	return 0;
>  }
>  

-- 
Thanks,
Kamalesh