Hi Frederic Weisbecker, On 2022-11-23 at 15:37:58 +0100, Frederic Weisbecker wrote: > On Mon, Nov 21, 2022 at 01:37:06PM +0800, Pengfei Xu wrote: > > Hi Frederic Weisbecker and kernel developers, > > > > Greeting! > > There is task hung in "synchronize_rcu" in v6.1-rc5 kernel. > > > > Bisected the issue on Raptor and server(No atom small core, big core only), > > both platforms bisected results show that: > > first bad commit is c597bfddc9e9e8a63817252b67c3ca0e544ace26: > > "sched: Provide Kconfig support for default dynamic preempt mode" > > > > [ 300.097166] INFO: task rcu_tasks_kthre:11 blocked for more than 147 seconds. > > [ 300.097455] Not tainted 6.1.0-rc5-094226ad94f4 #1 > > [ 300.097641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [ 300.097922] task:rcu_tasks_kthre state:D stack:0 pid:11 ppid:2 flags:0x00004000 > > [ 300.098230] Call Trace: > > [ 300.098325] <TASK> > > [ 300.098410] __schedule+0x2de/0x8f0 > > [ 300.098562] schedule+0x5b/0xe0 > > [ 300.098693] schedule_timeout+0x3f1/0x4b0 > > [ 300.098849] ? __sanitizer_cov_trace_pc+0x25/0x60 > > [ 300.099032] ? queue_delayed_work_on+0x82/0xc0 > > [ 300.099206] wait_for_completion+0x81/0x140 > > [ 300.099373] __synchronize_srcu.part.23+0x83/0xb0 > > [ 300.099558] ? __bpf_trace_rcu_stall_warning+0x20/0x20 > > [ 300.099757] synchronize_srcu+0xd6/0x100 > > [ 300.099913] rcu_tasks_postscan+0x19/0x20 > > [ 300.100070] rcu_tasks_wait_gp+0x108/0x290 > > [ 300.100230] ? _raw_spin_unlock+0x1d/0x40 > > [ 300.100389] rcu_tasks_one_gp+0x27f/0x370 > > [ 300.100546] ? rcu_tasks_postscan+0x20/0x20 > > [ 300.100709] rcu_tasks_kthread+0x37/0x50 > > [ 300.100863] kthread+0x14d/0x190 > > [ 300.100998] ? kthread_complete_and_exit+0x40/0x40 > > [ 300.101199] ret_from_fork+0x1f/0x30 > > [ 300.101347] </TASK> > > Thanks for reporting this. Fortunately I managed to reproduce and debug. > It took me a few days to understand the complicated circular dependency > involved. > > So here is a summary: > > 1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace > that every subsequent child of TASK A will belong to. But TASK A doesn't > itself belong to that new PID namespace. > > 2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a > thread group leader). TASK A stays attached to its PID namespace (let's say PID_NS1) > and TASK B is the first task belonging to the new PID namespace created by > unshare() (let's call it PID_NS2). > > 3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2 > child reaper. > > 4) TASK A forks() again and creates TASK C which get attached to PID_NS2. > Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has > TASK B (belonging to PID_NS2) as a pid_namespace child_reaper. > > 3) TASK B exits and since it is the child reaper for PID_NS2, it has to > kill all other tasks attached to PID_NS2, and wait for all of them to die > before reaping itself (zap_pid_ns_process()). Note it seems to make a > misleading assumption here, trusting that all tasks in PID_NS2 either > get reaped by a parent belonging to the same namespace or by TASK B. > And it is confident that since it deactivated SIGCHLD handler, all > the remaining tasks ultimately autoreap. And it waits for that to happen. > However TASK C escapes that rule because it will get reaped by its parent > TASK A belonging to PID_NS1. > > 4) TASK A calls synchronize_rcu_tasks() which leads to > synchronize_srcu(&tasks_rcu_exit_srcu). > > 5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps) > But TASK B is under a tasks_rcu_exit_srcu SRCU critical section > (exit_notify() is between exit_tasks_rcu_start() and > exit_tasks_rcu_finish()), blocking TASK A > > 6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C, > but it can't because TASK A waits for TASK B that waits for TASK C. > > So there is a circular dependency: > > _ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical > section > _ TASK B waits for TASK C to get reaped > _ TASK C waits for TASK A to reap it. > > I have no idea how to solve the situation without violating the pid_namespace > rules and unshare() semantics (although I wish unshare(CLONE_NEWPID) had a less > error prone behaviour with allowing creating more than one task belonging to the > same namespace). > > So probably having an SRCU read side critical section within exit_notify() is > not a good idea, is there a solution to work around that for rcu tasks? > Thanks for the analysis! Add one more information: I tried to revert this commit only on top of v6.1-rc5 mainline by script, but it caused kernel make to fail, it could not confirm the bisect information is 100% accurate if I could not pass the revert step verification. I just provide all the information I could. And this issue is too difficult to me. If I find more clue, I will update the eamil. Thanks! BR. > Thanks.