Re: [PATCH 1/4 v2] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state

JP Kobryn <inwardvessel@xxxxxxxxx> · Fri, 28 Feb 2025 17:30:08 -0800




On 2/28/25 5:25 PM, Yosry Ahmed wrote:
On Fri, Feb 28, 2025 at 05:06:23PM -0800, JP Kobryn wrote:
[..]

   		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
   		if (ss->css_released)
[..]
@@ -6188,6 +6186,9 @@ int __init cgroup_init(void)
   			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
   						   GFP_KERNEL);
   			BUG_ON(css->id < 0);
+
+			if (css->ss && css->ss->css_rstat_flush)
+				BUG_ON(cgroup_rstat_init(css));

Why do we need this call here? We already call cgroup_rstat_init() in
cgroup_init_subsys(). IIUC for subsystems with ss->early_init, we will
have already called cgroup_init_subsys() in cgroup_init_early().

Did I miss something?

Hmmm it's a good question. cgroup_rstat_init() is deferred in the same
way that cgroup_idr_alloc() is. So for ss with early_init == true,
cgroup_rstat_init() is not called during cgroup_early_init().

Oh I didn't realize that the call here is only when early_init == false.
I think we need a comment to clarify that cgroup_idr_alloc() and
cgroup_rstat_init() are not called in cgroup_init_subsys() when
early_init == true, and hence need to be called in cgroup_init().

Or maybe organize the code in a way to make this more obvious (put them
in a helper with a descriptive name? idk).

I see what you're getting at. Let me think of something for v3.



Is it safe to call alloc_percpu() during early boot? If so, the
deferral can be removed and cgroup_rstat_init() can be called in one
place.

I don't think so. cgroup_init_early() is called before
setup_per_cpu_areas().

Cool. Thanks for pointing that out.




   		} else {
   			cgroup_init_subsys(ss, false);
   		}
[..]
@@ -300,27 +306,25 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
   }
   /* see cgroup_rstat_flush() */
-static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
+static void cgroup_rstat_flush_locked(struct cgroup_subsys_state *css)
   	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
   {
+	struct cgroup *cgrp = css->cgroup;
   	int cpu;
   	lockdep_assert_held(&cgroup_rstat_lock);
   	for_each_possible_cpu(cpu) {
-		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
+		struct cgroup_subsys_state *pos;
+		pos = cgroup_rstat_updated_list(css, cpu);
   		for (; pos; pos = pos->rstat_flush_next) {
-			struct cgroup_subsys_state *css;
+			if (!pos->ss)
+				cgroup_base_stat_flush(pos->cgroup, cpu);
+			else
+				pos->ss->css_rstat_flush(pos, cpu);
-			cgroup_base_stat_flush(pos, cpu);
-			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
-
-			rcu_read_lock();
-			list_for_each_entry_rcu(css, &pos->rstat_css_list,
-						rstat_css_node)
-				css->ss->css_rstat_flush(css, cpu);
-			rcu_read_unlock();
+			bpf_rstat_flush(pos->cgroup, cgroup_parent(pos->cgroup), cpu);

We should call bpf_rstat_flush() only if (!pos->ss) as well, right?
Otherwise we will call BPF rstat flush whenever any subsystem is
flushed.

I guess it's because BPF can now pass any subsystem to
cgroup_rstat_flush(), and we don't keep track. I think it would be
better if we do not allow BPF programs to select a css and always make
them flush the self css.

We can perhaps introduce a bpf_cgroup_rstat_flush() wrapper that takes
in a cgroup and passes cgroup->self internally to cgroup_rstat_flush().

I'm fine with this if others are in agreement. A similar concept was
done in v1.

Let's wait for Shakeel to chime in here since he suggested removing this
hook, but I am not sure if he intended to actually do it or not. Better
not to waste effort if this will be gone soon anyway.



But if the plan is to remove the bpf_rstat_flush() call here soon then
it's probably not worth the hassle.

Shakeel (and others), WDYT?