On Wed, May 25, 2022 at 06:14:55PM +0200, Michal Koutný <mkoutny@xxxxxxxx> wrote: > But the above is not correct. I've looked at the stack trace [1] and the > offending percpu_ref_put_many is called from an RCU callback > percpu_ref_switch_to_atomic_rcu(), so I can't actually see why it drops > to zero there... The link [1] should have been [1]. After some more thought, the following is possible sequencing of involved functions. // ref=A: initial state kill_css() css_get // ref+=F == A+F: fuse percpu_ref_kill_and_confirm __percpu_ref_switch_to_atomic percpu_ref_get // ref += 1 == A+F+1: atomic mode, self-protection percpu_ref_put // ref -= 1 == A+F: kill the base reference [via rcu] percpu_ref_switch_to_atomic_rcu percpu_ref_call_confirm_rcu css_killed_ref_fn == refcnt.confirm_switch queue_work(css->destroy_work) (1) [via css->destroy_work] css_killed_work_fn == wq.func offline_css() // needs fuse css_put // ref -= F == A: de-fuse percpu_ref_put // ref -= 1 == A-1: remove self-protection css_release // A <= 1 -> 2nd queue_work explodes! queue_work(css->destroy_work) (2) [via css->destroy_work] css_release_work_fn == wq.func Another CPU would have to dispatch and run the css_killed_work_fn callback in parallel to percpu_ref_switch_to_atomic_rcu. It's a more correct explanation, however, its likelihood does seem very low. Perhaps some debug prints of percpu_ref_data.data in percpu_ref_call_confirm_rcu could shed more light onto this [2]. HTH, Michal [1] https://syzkaller.appspot.com/text?tag=CrashReport&x=162b5781f00000 [2] I tried notifying syzbot about [3] moments ago. [3] https://github.com/Werkov/linux/tree/cgroup-ml/css-lifecycle-syzbot