On Wed, Jul 10, 2024 at 11:02:57AM +0800, chenridong wrote: > > > On 2024/7/9 21:42, chenridong wrote: > > > > > > On 2024/6/10 10:47, Roman Gushchin wrote: > > > Hi Chen! > > > > > > Was this problem found in the real life? Do you have a LOCKDEP > > > splash available? > > > > > Sorry for the late email response. > > Yes, it was. The issue occurred after a long period of stress testing, > > with a very low probability. > > > > On Jun 7, 2024, at 4:09 AM, Chen Ridong <chenridong@xxxxxxxxxx> wrote: > > > > > > > > We found an AA deadlock problem as shown belowed: > > > > > > > > cgroup_destroy_wq TaskB WatchDog > > > > system_wq > > > > > > > > ... > > > > css_killed_work_fn: > > > > P(cgroup_mutex) > > > > ... > > > > ... > > > > __lockup_detector_reconfigure: > > > > P(cpu_hotplug_lock.read) > > > > ... > > > > ... > > > > percpu_down_write: > > > > P(cpu_hotplug_lock.write) > > > > ... > > > > cgroup_bpf_release: > > > > P(cgroup_mutex) > > > > smp_call_on_cpu: > > > > Wait system_wq > > > > > > > > cpuset_css_offline: > > > > P(cpu_hotplug_lock.read) > > > > > > > > WatchDog is waiting for system_wq, who is waiting for cgroup_mutex, to > > > > finish the jobs, but the owner of the cgroup_mutex is waiting for > > > > cpu_hotplug_lock. This problem caused by commit 4bfc0bb2c60e ("bpf: > > > > decouple the lifetime of cgroup_bpf from cgroup itself") > > > > puts cgroup_bpf release work into system_wq. As cgroup_bpf is a > > > > member of > > > > cgroup, it is reasonable to put cgroup bpf release work into > > > > cgroup_destroy_wq, which is only used for cgroup's release work, and the > > > > preblem is solved. > > > > > > I need to think more on this, but at first glance the fix looks a > > > bit confusing. cgroup_bpf_release() looks quite innocent, it only > > > takes a cgroup_mutex. It’s not obvious why it’s not ok and requires > > > a dedicated work queue. What exactly is achieved by placing it back > > > on the dedicated cgroup destroy queue? > > > > > > I’m not trying to say your fix won’t work, but it looks like it > > > might cover a more serious problem. > > > > The issue lies in the fact that different tasks require the cgroup_mutex > > and cpu_hotplug_lock locks, eventually forming a deadlock. Placing > > cgroup bpf release work on cgroup destroy queue can break loop. > > > The max_active of system_wq is WQ_DFL_ACTIVE(256). If all active works are > cgroup bpf release works, it will block smp_call_on_cpu work which enque > after cgroup bpf releases. So smp_call_on_cpu holding cpu_hotplug_lock will > wait for completion, but it can never get a completion because cgroup bpf > release works can not get cgroup_mutex and will never finish. > However, Placing the cgroup bpf release works on cgroup destroy will never > block smp_call_on_cpu work, which means loop is broken. Thus, it can solve > the problem. Tejun, do you have an opinion on this? If there are certain limitations from the cgroup side on what can be done in a generic work context, it would be nice to document (e.g. don't grab cgroup mutex), but I still struggle to understand what exactly is wrong with the blamed commit. Thanks, Roman