Re: [PATCH -next] cgroup: Fix AA deadlock caused by cgroup_bpf_release

Roman Gushchin <roman.gushchin@xxxxxxxxx> · Thu, 11 Jul 2024 03:52:34 +0000

On Wed, Jul 10, 2024 at 11:02:57AM +0800, chenridong wrote:
> 
> 
> On 2024/7/9 21:42, chenridong wrote:
> > 
> > 
> > On 2024/6/10 10:47, Roman Gushchin wrote:
> > > Hi Chen!
> > > 
> > > Was this problem found in the real life? Do you have a LOCKDEP
> > > splash available?
> > > 
> > Sorry for the late email response.
> > Yes, it was. The issue occurred after a long period of stress testing,
> > with a very low probability.
> > > > On Jun 7, 2024, at 4:09 AM, Chen Ridong <chenridong@xxxxxxxxxx> wrote:
> > > > 
> > > > We found an AA deadlock problem as shown belowed:
> > > > 
> > > > cgroup_destroy_wq        TaskB                WatchDog
> > > > system_wq
> > > > 
> > > > ...
> > > > css_killed_work_fn:
> > > > P(cgroup_mutex)
> > > > ...
> > > >                                 ...
> > > >                                 __lockup_detector_reconfigure:
> > > >                                 P(cpu_hotplug_lock.read)
> > > >                                 ...
> > > >                 ...
> > > >                 percpu_down_write:
> > > >                 P(cpu_hotplug_lock.write)
> > > >                                                 ...
> > > >                                                 cgroup_bpf_release:
> > > >                                                 P(cgroup_mutex)
> > > >                                 smp_call_on_cpu:
> > > >                                 Wait system_wq
> > > > 
> > > > cpuset_css_offline:
> > > > P(cpu_hotplug_lock.read)
> > > > 
> > > > WatchDog is waiting for system_wq, who is waiting for cgroup_mutex, to
> > > > finish the jobs, but the owner of the cgroup_mutex is waiting for
> > > > cpu_hotplug_lock. This problem caused by commit 4bfc0bb2c60e ("bpf:
> > > > decouple the lifetime of cgroup_bpf from cgroup itself")
> > > > puts cgroup_bpf release work into system_wq. As cgroup_bpf is a
> > > > member of
> > > > cgroup, it is reasonable to put cgroup bpf release work into
> > > > cgroup_destroy_wq, which is only used for cgroup's release work, and the
> > > > preblem is solved.
> > > 
> > > I need to think more on this, but at first glance the fix looks a
> > > bit confusing. cgroup_bpf_release() looks quite innocent, it only
> > > takes a cgroup_mutex. It’s not obvious why it’s not ok and requires
> > > a dedicated work queue. What exactly is achieved by placing it back
> > > on the dedicated cgroup destroy queue?
> > > 
> > > I’m not trying to say your fix won’t work, but it looks like it
> > > might cover a more serious problem.
> > 
> > The issue lies in the fact that different tasks require the cgroup_mutex
> > and cpu_hotplug_lock locks, eventually forming a deadlock. Placing
> > cgroup bpf release work on cgroup destroy queue can break loop.
> > 
> The max_active of system_wq is WQ_DFL_ACTIVE(256). If all active works are
> cgroup bpf release works, it will block smp_call_on_cpu work which enque
> after cgroup bpf releases. So smp_call_on_cpu holding cpu_hotplug_lock will
> wait for completion, but it can never get a completion because cgroup bpf
> release works can not get cgroup_mutex and will never finish.
> However, Placing the cgroup bpf release works on cgroup destroy will never
> block smp_call_on_cpu work, which means loop is broken. Thus, it can solve
> the problem.

Tejun,

do you have an opinion on this?

If there are certain limitations from the cgroup side on what can be done
in a generic work context, it would be nice to document (e.g. don't grab
cgroup mutex), but I still struggle to understand what exactly is wrong
with the blamed commit.

Thanks,
Roman