I rechecked the logs and found no information about who may be holding the lock. I have only identified more different stack traces, waiting for locks. These are for instance: Oct 8 11:01:27 kaa-12 kernel: [86845.048183] [<ffffffff813c3b58>] mutex_lock+0x12/0x22 Oct 8 11:01:27 kaa-12 kernel: [86845.048192] [<ffffffff81085e57>] cgroup_rmdir+0x15/0x35 Oct 8 11:01:27 kaa-12 kernel: [86845.048200] [<ffffffff810fe7d6>] vfs_rmdir+0x69/0xb4 Oct 8 11:01:27 kaa-12 kernel: [86845.048207] [<ffffffff810fe8eb>] do_rmdir+0xca/0x137 Oct 8 11:01:27 kaa-12 kernel: [86845.048217] [<ffffffff8100c259>] ? syscall_trace_enter+0xd5/0x14c Oct 8 11:01:27 kaa-12 kernel: [86845.048359] [<ffffffff813c3b58>] mutex_lock+0x12/0x22 Oct 8 11:01:27 kaa-12 kernel: [86845.048368] [<ffffffff8108286a>] cgroup_free_fn+0x1f/0xc3 Oct 8 11:01:27 kaa-12 kernel: [86845.048378] [<ffffffff81047cb7>] process_one_work+0x15f/0x21e Oct 8 11:01:27 kaa-12 kernel: [86845.048762] [<ffffffff813c3b58>] mutex_lock+0x12/0x22 Oct 8 11:01:27 kaa-12 kernel: [86845.048770] [<ffffffff810841e8>] cgroup_release_agent+0x24/0x141 Oct 8 11:01:27 kaa-12 kernel: [86845.048778] [<ffffffff813c56d6>] ? __schedule+0x4b2/0x560 Oct 8 11:01:27 kaa-12 kernel: [86845.048787] [<ffffffff81047cb7>] process_one_work+0x15f/0x21e Oct 8 11:01:27 kaa-12 kernel: [86845.049639] [<ffffffff813c3b58>] mutex_lock+0x12/0x22 Oct 8 11:01:27 kaa-12 kernel: [86845.049647] [<ffffffff8108286a>] cgroup_free_fn+0x1f/0xc3 Oct 8 11:01:27 kaa-12 kernel: [86845.049657] [<ffffffff81047cb7>] process_one_work+0x15f/0x21e But i suppose, the lock is lost elsewhere. Are there any kernel options i could activate for more debug output or some tools to find out, who is holding the lock (or who forgot to unlock). On Fri, Oct 11, 2013 at 3:06 PM, Li Zefan <lizefan@xxxxxxxxxx> wrote: > On 2013/10/10 16:50, Markus Blank-Burian wrote: >> Hi, >> > > Thanks for the report. > >> I have upgraded all nodes on our computing cluster to 3.11.3 last week (from >> 3.10.9) and experience deadlocks in kernel threads connected to cgroups. They >> appear sometimes, when our queuing system (slurm 2.6.0) tries to clean up its >> cgroups (using freezer, cpuset, memory and devices subsets). I have attached >> the associated kernel messages as well als the cleanup script. >> > > We've changed the cgroup destroy path dramatically including using per-cpu > ref, so those changes probably introduced this bug. > >> Oct 10 00:39:48 kaa-14 kernel: [169967.617545] INFO: task kworker/7:0:5201 blocked for more than 120 seconds. >> Oct 10 00:39:48 kaa-14 kernel: [169967.617557] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Oct 10 00:39:48 kaa-14 kernel: [169967.617563] kworker/7:0 D ffff88077e873328 0 5201 2 0x00000000 >> Oct 10 00:39:48 kaa-14 kernel: [169967.617583] Workqueue: events cgroup_offline_fn >> Oct 10 00:39:48 kaa-14 kernel: [169967.617590] ffff8804a4129d70 0000000000000002 ffff8804adc60000 ffff8804a4129fd8 >> Oct 10 00:39:48 kaa-14 kernel: [169967.617599] ffff8804a4129fd8 0000000000011c40 ffff88077e872ee0 ffffffff81634ae0 >> Oct 10 00:39:48 kaa-14 kernel: [169967.617608] ffffffff81634ae4 ffff88077e872ee0 ffffffff81634ae8 00000000ffffffff >> Oct 10 00:39:48 kaa-14 kernel: [169967.617617] Call Trace: >> Oct 10 00:39:48 kaa-14 kernel: [169967.617634] [<ffffffff813c57e4>] schedule+0x60/0x62 >> Oct 10 00:39:48 kaa-14 kernel: [169967.617645] [<ffffffff813c5a6b>] schedule_preempt_disabled+0x13/0x1f >> Oct 10 00:39:48 kaa-14 kernel: [169967.617654] [<ffffffff813c4987>] __mutex_lock_slowpath+0x143/0x1d4 >> Oct 10 00:39:48 kaa-14 kernel: [169967.617665] [<ffffffff8105a3e8>] ? arch_vtime_task_switch+0x6a/0x6f >> Oct 10 00:39:48 kaa-14 kernel: [169967.617673] [<ffffffff813c3b58>] mutex_lock+0x12/0x22 >> Oct 10 00:39:48 kaa-14 kernel: [169967.617681] [<ffffffff81084f4f>] cgroup_offline_fn+0x36/0x137 > > All the tasks are blocked in cgroup mutex, but it doesn't tell us who's > holding this lock, which is vital. > > Is there any other kernel warnings in the kernel log? > > -- > To unsubscribe from this list: send the line "unsubscribe cgroups" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html