Re: [PATCH v2] cgroup: serialize css kill and release paths

Tadeusz Struk <tadeusz.struk@xxxxxxxxxx> · Tue, 7 Jun 2022 11:47:00 -0700

On 6/6/22 05:39, Michal Koutný wrote:
On Fri, Jun 03, 2022 at 11:13:21AM -0700, Tadeusz Struk<tadeusz.struk@xxxxxxxxxx>  wrote:
In such scenario the css_killed_work_fn will be en-queued via
cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to
cgroup_kn_unlock(). Then cgroup_kn_unlock() will call:
cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue
css_release_work_fn for the same css instance, causing a list_add
corruption bug, as can be seen in the syzkaller report [1].
This hypothesis doesn't add up to me (I am sorry).

The kill_css(css) would be a css associated with a subsys (css.ss !=
NULL) whereas css_put(&cgrp->self) is a different css just for the
cgroup (css.ss == NULL).

Yes, you are right. I couldn't figure it out where the extra css_put()
is called from, and the only place that fitted into my theory was from
the cgroup_kn_unlock() in cgroup_apply_control_disable().
After some more debugging I can see that, as you said, the cgrp->self
is a different css. The offending _put() is actually called by the
percpu_ref_kill_and_confirm(), as it not only calls the passed confirm_kill
percpu_ref_func_t, but also it puts the refcnt iself.
Because the cgroup_apply_control_disable() will loop for_each_live_descendant,
and call css_kill() on all css'es, and css_killed_work_fn() will also loop
and call css_put() on all parents, the css_release() will be called on the
first parent prematurely, causing the BUG(). What I think should be done
to balance put/get is to call css_get() for all the parents in kill_css():

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c1e1a5c34e77..3ca61325bc4e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5527,6 +5527,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
  */
 static void kill_css(struct cgroup_subsys_state *css)
 {
+       struct cgroup_subsys_state *_css = css;
+
        lockdep_assert_held(&cgroup_mutex);
 
        if (css->flags & CSS_DYING)
@@ -5541,10 +5543,13 @@ static void kill_css(struct cgroup_subsys_state *css)
        css_clear_dir(css);
 
        /*
-        * Killing would put the base ref, but we need to keep it alive
-        * until after ->css_offline().
+        * Killing would put the base ref, but we need to keep it alive,
+        * and all its parents, until after ->css_offline().
         */
-       css_get(css);
+       do {
+               css_get(_css);
+               _css = _css->parent;
+       } while (_css && atomic_read(&_css->online_cnt));
 
        /*
         * cgroup core guarantees that, by the time ->css_offline() is

This will be then "reverted" in css_killed_work_fn()
Please let me know if it makes sense to you.
I'm still testing it, but syzbot is very slow today.

--
Thanks,
Tadeusz