Re: [PATCH v5 1/3] cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction

Chen Ridong <chenridong@xxxxxxxxxxxxxxx> · Fri, 27 Sep 2024 17:50:49 +0800

On 2024/9/27 12:22, Vishal Chourasia wrote:
On Mon, Sep 23, 2024 at 11:43:50AM +0000, Chen Ridong wrote:
I found a hung_task problem as shown below:

INFO: task kworker/0:0:8 blocked for more than 327 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Workqueue: events cgroup_bpf_release
Call Trace:
  <TASK>
  __schedule+0x5a2/0x2050
  ? find_held_lock+0x33/0x100
  ? wq_worker_sleeping+0x9e/0xe0
  schedule+0x9f/0x180
  schedule_preempt_disabled+0x25/0x50
  __mutex_lock+0x512/0x740
  ? cgroup_bpf_release+0x1e/0x4d0
  ? cgroup_bpf_release+0xcf/0x4d0
  ? process_scheduled_works+0x161/0x8a0
  ? cgroup_bpf_release+0x1e/0x4d0
  ? mutex_lock_nested+0x2b/0x40
  ? __pfx_delay_tsc+0x10/0x10
  mutex_lock_nested+0x2b/0x40
  cgroup_bpf_release+0xcf/0x4d0
  ? process_scheduled_works+0x161/0x8a0
  ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
  ? process_scheduled_works+0x161/0x8a0
  process_scheduled_works+0x23a/0x8a0
  worker_thread+0x231/0x5b0
  ? __pfx_worker_thread+0x10/0x10
  kthread+0x14d/0x1c0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x59/0x70
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>

This issue can be reproduced by the following pressuse test:
1. A large number of cpuset cgroups are deleted.
2. Set cpu on and off repeatly.
3. Set watchdog_thresh repeatly.
The scripts can be obtained at LINK mentioned above the signature.

The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
acquired in different tasks, which may lead to deadlock.
It can lead to a deadlock through the following steps:
1. A large number of cpusets are deleted asynchronously, which puts a
    large number of cgroup_bpf_release works into system_wq. The max_active
    of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
    cgroup_bpf_release works, and many cgroup_bpf_release works will be put
    into inactive queue. As illustrated in the diagram, there are 256 (in
    the acvtive queue) + n (in the inactive queue) works.
2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
    smp_call_on_cpu work into system_wq. However step 1 has already filled
    system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
    to wait until the works that were put into the inacvtive queue earlier
    have executed (n cgroup_bpf_release), so it will be blocked for a while.
3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
4. Cpusets that were deleted at step 1 put cgroup_release works into
    cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
    When cgroup_metux is acqured by work at css_killed_work_fn, it will
    call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
    However, cpuset_css_offline will be blocked for step 3.
5. At this moment, there are 256 works in active queue that are
    cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
    a result, all of them are blocked. Consequently, sscs.work can not be
    executed. Ultimately, this situation leads to four processes being
    blocked, forming a deadlock.

system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
...
2000+ cgroups deleted asyn
256 actives + n inactives
				__lockup_detector_reconfigure
				P(cpu_hotplug_lock.read)
				put sscs.work into system_wq
256 + n + 1(sscs.work)
sscs.work wait to be executed
				warting sscs.work finish
								percpu_down_write
								P(cpu_hotplug_lock.write)
								...blocking...
											css_killed_work_fn
											P(cgroup_mutex)
											cpuset_css_offline
											P(cpu_hotplug_lock.read)
											...blocking...
256 cgroup_bpf_release
mutex_lock(&cgroup_mutex);
..blocking...

To fix the problem, place cgroup_bpf_release works on a dedicated
workqueue which can break the loop and solve the problem. System wqs are
for misc things which shouldn't create a large number of concurrent work
items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
work items, it should use its own dedicated workqueue.

Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@xxxxxxxxxx/T/#t
Signed-off-by: Chen Ridong <chenridong@xxxxxxxxxx>
Tested-by: Vishal Chourasia <vishalc@xxxxxxxxxxxxx>

Thank you Chen, for sharing the details on how to reproduce, and for the
patchset.

Steps I followed to reproduce:
1) run cgroup-make.sh
2) run hotplug.sh
3) run watchdog.sh

# cat cgroup-make.sh
#!/bin/bash

echo 30 > /proc/sys/kernel/hung_task_timeout_secs
cat /proc/sys/kernel/hung_task_timeout_secs

echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
mkdir /sys/fs/cgroup/memory
# echo +memory > /sys/fs/cgroup/memory/cgroup.subtree_control
mkdir /sys/fs/cgroup/cpuset
echo +cpuset > /sys/fs/cgroup/cpuset/cgroup.subtree_control
echo +cpu > /sys/fs/cgroup/cpuset/cgroup.subtree_control


timestamp=$(date +%s)
echo $timestamp
while true; do
         for i in {0..2000}; do
                 mkdir /sys/fs/cgroup/cpuset/test${timestamp}_${i} &
                 mkdir /sys/fs/cgroup/memory/test${timestamp}_${i} &
         done

         for i in {0..2000}; do
                 rmdir /sys/fs/cgroup/cpuset/test${timestamp}_${i} &
                 rmdir /sys/fs/cgroup/memory/test${timestamp}_${i} &
         done
done

# cat hotplug.sh
#!/bin/bash

while true
do
echo 1 > /sys/devices/system/cpu/cpu2/online
echo 1 > /sys/devices/system/cpu/cpu3/online
echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu3/online
done

# cat watchdog.sh
#!/bin/bash

while true
do
echo 12 > /proc/sys/kernel/watchdog_thresh
echo 11 > /proc/sys/kernel/watchdog_thresh
echo 10 > /proc/sys/kernel/watchdog_thresh
done

With these steps I able to get the hung_task timeout log messages
INFO: task kworker/7:1:84 blocked for more than 30 seconds.
       Not tainted 6.11.0-chenridong_base-10547-g684a64bf32b6-dirty #59
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/7:1     state:D stack:0     pid:84    tgid:84    ppid:2      flags:0x00000000
Workqueue: events cgroup_bpf_release
Call Trace:
[c00000000ee779a0] [c00000000ee779e0] 0xc00000000ee779e0 (unreliable)
[c00000000ee77b50] [c00000000001f79c] __switch_to+0x14c/0x220
[c00000000ee77bb0] [c0000000010e8cd0] __schedule+0x2c0/0x840
[c00000000ee77c90] [c0000000010e9290] schedule+0x40/0x110
[c00000000ee77d00] [c0000000010e95b0] schedule_preempt_disabled+0x20/0x30
[c00000000ee77d20] [c0000000010ec408] __mutex_lock.constprop.0+0x5e8/0xbe0
[c00000000ee77db0] [c000000000472f58] cgroup_bpf_release+0x98/0x3d0
[c00000000ee77e40] [c0000000001886a8] process_one_work+0x1f8/0x520
[c00000000ee77ef0] [c00000000018a01c] worker_thread+0x33c/0x4f0
[c00000000ee77f90] [c0000000001970c8] kthread+0x138/0x140
[c00000000ee77fe0] [c00000000000dd58] start_kernel_thread+0x14/0x18
INFO: task kworker/4:1:98 blocked for more than 30 seconds.
       Not tainted 6.11.0-chenridong_base-10547-g684a64bf32b6-dirty #59
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/4:1     state:D stack:0     pid:98    tgid:98    ppid:2      flags:0x00000000
Workqueue: events cgroup_bpf_release
Call Trace:
[c00000000ee1f9a0] [c00000000ee1f9e0] 0xc00000000ee1f9e0 (unreliable)
[c00000000ee1fb50] [c00000000001f79c] __switch_to+0x14c/0x220
[c00000000ee1fbb0] [c0000000010e8cd0] __schedule+0x2c0/0x840
[c00000000ee1fc90] [c0000000010e9290] schedule+0x40/0x110
[c00000000ee1fd00] [c0000000010e95b0] schedule_preempt_disabled+0x20/0x30
[c00000000ee1fd20] [c0000000010ec408] __mutex_lock.constprop.0+0x5e8/0xbe0
[c00000000ee1fdb0] [c000000000472f58] cgroup_bpf_release+0x98/0x3d0
[c00000000ee1fe40] [c0000000001886a8] process_one_work+0x1f8/0x520
[c00000000ee1fef0] [c00000000018a01c] worker_thread+0x33c/0x4f0
[c00000000ee1ff90] [c0000000001970c8] kthread+0x138/0x140
[c00000000ee1ffe0] [c00000000000dd58] start_kernel_thread+0x14/0x18
INFO: task kworker/5:1:110 blocked for more than 30 seconds.
       Not tainted 6.11.0-chenridong_base-10547-g684a64bf32b6-dirty #59
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/5:1     state:D stack:0     pid:110   tgid:110   ppid:2      flags:0x00000000
Workqueue: events cgroup_bpf_release
Call Trace:
[c0000000608bf9a0] [c0000000608bf9e0] 0xc0000000608bf9e0 (unreliable)
[c0000000608bfb50] [c00000000001f79c] __switch_to+0x14c/0x220
[c0000000608bfbb0] [c0000000010e8cd0] __schedule+0x2c0/0x840
[c0000000608bfc90] [c0000000010e9290] schedule+0x40/0x110
[c0000000608bfd00] [c0000000010e95b0] schedule_preempt_disabled+0x20/0x30
[c0000000608bfd20] [c0000000010ec408] __mutex_lock.constprop.0+0x5e8/0xbe0
[c0000000608bfdb0] [c000000000472f58] cgroup_bpf_release+0x98/0x3d0
[c0000000608bfe40] [c0000000001886a8] process_one_work+0x1f8/0x520
[c0000000608bfef0] [c00000000018a01c] worker_thread+0x33c/0x4f0
[c0000000608bff90] [c0000000001970c8] kthread+0x138/0x140
[c0000000608bffe0] [c00000000000dd58] start_kernel_thread+0x14/0x18
Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings

After applying this patchset, I didn't see any log messages being
printed in dmesg.

State of the git repo:
$git log --oneline
a40aebb33934 (HEAD -> patches/v5_20240923_chenridong_add_dedicated_wq_for_cgroup_bpf_and_adjust_wq_max_active) workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048
08a2979a9e59 workqueue: doc: Add a note saturating the system_wq is not permitted
0e6f5ea2955f cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
684a64bf32b6 Merge tag 'nfs-for-6.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs



---
  kernel/bpf/cgroup.c | 18 +++++++++++++++++-
  1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index e7113d700b87..1a7609f61d44 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -24,6 +24,22 @@
  DEFINE_STATIC_KEY_ARRAY_FALSE(cgroup_bpf_enabled_key, MAX_CGROUP_BPF_ATTACH_TYPE);
  EXPORT_SYMBOL(cgroup_bpf_enabled_key);
  
+/*
+ * cgroup bpf destruction makes heavy use of work items and there can be a lot
+ * of concurrent destructions.  Use a separate workqueue so that cgroup bpf
+ * destruction work items don't end up filling up max_active of system_wq
+ * which may lead to deadlock.
+ */
+static struct workqueue_struct *cgroup_bpf_destroy_wq;
+
+static int __init cgroup_bpf_wq_init(void)
+{
+	cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", 0, 1);
+	WARN_ON_ONCE(!cgroup_bpf_destroy_wq);
+	return 0;
+}
+core_initcall(cgroup_bpf_wq_init);
+
  /* __always_inline is necessary to prevent indirect call through run_prog
   * function pointer.
   */
@@ -334,7 +350,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
  	struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
  
  	INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
-	queue_work(system_wq, &cgrp->bpf.release_work);
+	queue_work(cgroup_bpf_destroy_wq, &cgrp->bpf.release_work);
  }
  
  /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through
--
2.34.1

Thank you for doing that.

Best regards,
Ridong