Hello, On Thu, Nov 14, 2013 at 04:56:49PM -0600, Shawn Bohrer wrote: > After running both concurrently on 40 machines for about 12 hours I've > managed to reproduce the issue at least once, possibly more. One > machine looked identical to this reported issue. It has a bunch of > stuck cgroup_free_fn() kworker threads and one thread in cpuset_attach > waiting on lru_add_drain_all(). A sysrq+l shows all CPUs are idle > except for the one triggering the sysrq+l. The sysrq+w unfortunately > wrapped dmesg so we didn't get the stacks of all blocked tasks. We > did however also cat /proc/<pid>/stack of all kworker threads on the > system. There were 265 kworker threads that all have the following > stack: Umm... so, WQ_DFL_ACTIVE is 256. It's just an arbitrarily largish number which is supposed to serve as protection against runaway kworker creation. The assumption there is that there won't be a dependency chain which can be longer than that and if there are it should be separated out into a separate workqueue. It looks like we *can* have such long chain of dependency with high enough rate of cgroup destruction. kworkers trying to destroy cgroups get blocked by an earlier one which is holding cgroup_mutex. If the blocked ones completely consume max_active and then the earlier one tries to perform an operation which makes use of the system_wq, the forward progress guarantee gets broken. So, yeah, it makes sense now. We're just gonna have to separate out cgroup destruction to a separate workqueue. Hugh's temp fix achieved about the same effect by putting the affected part of destruction to a different workqueue. I probably should have realized that we were hitting max_active when I was told that moving some part to a different workqueue makes the problem go away. Will send out a patch soon. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html