On 7/9/24 19:08, Roman Gushchin wrote:
On Tue, Jul 09, 2024 at 09:28:14AM -0400, Waiman Long wrote:
The /proc/cgroups file shows the number of cgroups for each of the
subsystems. With cgroup v1, the number of CSSes is the same as the
number of cgroups. That is not the case anymore with cgroup v2. The
/proc/cgroups file cannot show the actual number of CSSes for the
subsystems that are bound to cgroup v2.
So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
we can't tell by looking at /proc/cgroups which cgroup subsystems may be
responsible. This patch adds CSS counts in the cgroup_subsys structure
to keep track of the number of CSSes for each of the cgroup subsystems.
As cgroup v2 had deprecated the use of /proc/cgroups, the root
cgroup.stat file is extended to show the number of outstanding CSSes
associated with all the non-inhibited cgroup subsystems that have been
bound to cgroup v2. This will help us pinpoint which subsystems may be
responsible for the increasing number of dying (nr_dying_descendants)
cgroups.
The cgroup-v2.rst file is updated to discuss this new behavior.
With this patch applied, a sample output from root cgroup.stat file
was shown below.
nr_descendants 53
nr_dying_descendants 34
nr_cpuset 1
nr_cpu 40
nr_io 40
nr_memory 87
nr_perf_event 54
nr_hugetlb 1
nr_pids 53
nr_rdma 1
nr_misc 1
In this particular case, it can be seen that memory cgroup is the most
likely culprit for causing the 34 dying cgroups.
Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++--
include/linux/cgroup-defs.h | 3 +++
kernel/cgroup/cgroup.c | 19 +++++++++++++++++++
3 files changed, 30 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 52763d6b2919..65af2f30196f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -981,6 +981,12 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
+ nr_<cgroup_subsys>
+ Total number of cgroups associated with that cgroup
+ subsystem, e.g. cpuset or memory. These cgroup counts
+ will only be shown in the root cgroup and for subsystems
+ bound to cgroup v2.
+
cgroup.freeze
A read-write single value file which exists on non-root cgroups.
Allowed values are "0" and "1". The default is "0".
@@ -2930,8 +2936,8 @@ Deprecated v1 Core Features
- "cgroup.clone_children" is removed.
-- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
- at the root instead.
+- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
+ "cgroup.stat" files at the root instead.
Issues with v1 and Rationales for v2
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index b36690ca0d3f..522ab77f0406 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -776,6 +776,9 @@ struct cgroup_subsys {
* specifies the mask of subsystems that this one depends on.
*/
unsigned int depends_on;
+
+ /* Number of CSSes, used only for /proc/cgroups */
+ atomic_t nr_csses;
I believe it should be doable without atomics because most of css
operations are already synchronized using the cgroup mutex.
css_create() was protected under cgroup_mutex, but I don't believe
css_free_rwork_fn() is. It is called from the kworker. So atomic_t is
still needed.
Other than that, I believe that this information is useful. Maybe
it can be retrieved using drgn/bpf iterator, but adding this functionality
to the kernel makes it easier to retrieve and the overhead is modest.
Also, if you add it to the cgroupfs, why not make it fully hierarchical
as existing entries in cgroup.stat. And if not, I'd agree with Johannes
that it looks like the debugfs material.
To make it hierarchical, I would have to store a nr_descendants and
nr_dying_descendants in each css, just like the corresponding ones in
cgroup. I think it is doable, but the patch will be much more complex.
Cheers,
Longman
Thanks!