On 5/26/22 11:15, Tejun Heo wrote:
Hello, Michal.
On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
// ref=A: initial state
kill_css()
css_get // ref+=F == A+F: fuse
percpu_ref_kill_and_confirm
__percpu_ref_switch_to_atomic
percpu_ref_get
// ref += 1 == A+F+1: atomic mode, self-protection
percpu_ref_put
// ref -= 1 == A+F: kill the base reference
[via rcu]
percpu_ref_switch_to_atomic_rcu
percpu_ref_call_confirm_rcu
css_killed_ref_fn == refcnt.confirm_switch
queue_work(css->destroy_work) (1)
[via css->destroy_work]
css_killed_work_fn == wq.func
offline_css() // needs fuse
css_put // ref -= F == A: de-fuse
percpu_ref_put
// ref -= 1 == A-1: remove self-protection
css_release // A <= 1 -> 2nd queue_work explodes!
I'm not sure I'm following it but it's perfectly fine to re-use the work
item at this point. The work item actually can be re-cycled from the very
beginning of the work function. The only thing we need to make sure is that
we don't css_put() prematurely to avoid it being freed while we're using it.
Yes, it is ok to reuse a work struct, but it's not ok to have the same
work struct enqueued twice on the same WQ when list debug is enabled.
That's why we are getting this "BUG: corrupted list.."
For the sharing to be a problem, we should be queueing the release work item
while the destroy instance is still pending, and if that is the case, it
doesn't really matter whether we use two separate work items or not. We're
already broken and would just be shifting the problem to explode elsewhere.
The only possibility that I can think of is that somehow we're ending up
with an extra css_put() somewhere thus triggering the release path
prematurely. If that's the case, we'll prolly need to trace get/puts to find
out who's causing the ref imbalance.
That's right. Michal was on the right track for the kill_css() part.
What I think is going on is that once css_create() fails then
cgroup_subtree_control_write() ends up calling first kill_css() and
then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
The each_live_descendant_post() also iterates on the root.
Here is the call flow (sorry for long lines):
cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
|
|-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
|
| css_killed_ref_fn() <- first css->destroy_work enqueue
| |
| |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
|
|
|-> goto out_unlock;
| |
| |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
|
|
css_release(percpu_ref) <- second css->destroy_work enqueue
|
|-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.
What seems to work for me as the simplest fix is to prevent enqueuing a dying
css in css_release() as below. Please let me know if that makes sense to you.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccddb734..5618211487cc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
struct cgroup_subsys_state *css =
container_of(ref, struct cgroup_subsys_state, refcnt);
- INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ if (!(css->flags & CSS_DYING)) {
+ INIT_WORK(&css->destroy_work, css_release_work_fn);
+ queue_work(cgroup_destroy_wq, &css->destroy_work);
+ }
}
static void init_and_link_css(struct cgroup_subsys_state *css,
--
Thanks,
Tadeusz