Re: PID_NS unshare VS synchronize_rcu_tasks() (was: Re: [Syzkaller & bisect] There is task hung in "synchronize_rcu" in v6.1-rc5 kernel)

Pengfei Xu <pengfei.xu@xxxxxxxxx> · Wed, 23 Nov 2022 23:45:50 +0800

Hi Frederic Weisbecker,

On 2022-11-23 at 15:37:58 +0100, Frederic Weisbecker wrote:
> On Mon, Nov 21, 2022 at 01:37:06PM +0800, Pengfei Xu wrote:
> > Hi Frederic Weisbecker and kernel developers,
> > 
> > Greeting!
> > There is task hung in "synchronize_rcu" in v6.1-rc5 kernel.
> > 
> > Bisected the issue on Raptor and server(No atom small core, big core only),
> > both platforms bisected results show that:
> > first bad commit is c597bfddc9e9e8a63817252b67c3ca0e544ace26:
> > "sched: Provide Kconfig support for default dynamic preempt mode"
> > 
> > [  300.097166] INFO: task rcu_tasks_kthre:11 blocked for more than 147 seconds.
> > [  300.097455]       Not tainted 6.1.0-rc5-094226ad94f4 #1
> > [  300.097641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  300.097922] task:rcu_tasks_kthre state:D stack:0     pid:11    ppid:2      flags:0x00004000
> > [  300.098230] Call Trace:
> > [  300.098325]  <TASK>
> > [  300.098410]  __schedule+0x2de/0x8f0
> > [  300.098562]  schedule+0x5b/0xe0
> > [  300.098693]  schedule_timeout+0x3f1/0x4b0
> > [  300.098849]  ? __sanitizer_cov_trace_pc+0x25/0x60
> > [  300.099032]  ? queue_delayed_work_on+0x82/0xc0
> > [  300.099206]  wait_for_completion+0x81/0x140
> > [  300.099373]  __synchronize_srcu.part.23+0x83/0xb0
> > [  300.099558]  ? __bpf_trace_rcu_stall_warning+0x20/0x20
> > [  300.099757]  synchronize_srcu+0xd6/0x100
> > [  300.099913]  rcu_tasks_postscan+0x19/0x20
> > [  300.100070]  rcu_tasks_wait_gp+0x108/0x290
> > [  300.100230]  ? _raw_spin_unlock+0x1d/0x40
> > [  300.100389]  rcu_tasks_one_gp+0x27f/0x370
> > [  300.100546]  ? rcu_tasks_postscan+0x20/0x20
> > [  300.100709]  rcu_tasks_kthread+0x37/0x50
> > [  300.100863]  kthread+0x14d/0x190
> > [  300.100998]  ? kthread_complete_and_exit+0x40/0x40
> > [  300.101199]  ret_from_fork+0x1f/0x30
> > [  300.101347]  </TASK>
> 
> Thanks for reporting this. Fortunately I managed to reproduce and debug.
> It took me a few days to understand the complicated circular dependency
> involved.
> 
> So here is a summary:
> 
> 1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
>    that every subsequent child of TASK A will belong to. But TASK A doesn't
>    itself belong to that new PID namespace.
> 
> 2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a
>    thread group leader). TASK A stays attached to its PID namespace (let's say PID_NS1)
>    and TASK B is the first task belonging to the new PID namespace created by
>    unshare()  (let's call it PID_NS2).
> 
> 3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2
>    child reaper.
> 
> 4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
>    Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
>    TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.
> 
> 3) TASK B exits and since it is the child reaper for PID_NS2, it has to
>    kill all other tasks attached to PID_NS2, and wait for all of them to die
>    before reaping itself (zap_pid_ns_process()). Note it seems to make a
>    misleading assumption here, trusting that all tasks in PID_NS2 either
>    get reaped by a parent belonging to the same namespace or by TASK B.
>    And it is confident that since it deactivated SIGCHLD handler, all
>    the remaining tasks ultimately autoreap. And it waits for that to happen.
>    However TASK C escapes that rule because it will get reaped by its parent
>    TASK A belonging to PID_NS1.
> 
> 4) TASK A calls synchronize_rcu_tasks() which leads to
>    synchronize_srcu(&tasks_rcu_exit_srcu).
> 
> 5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps)
>    But TASK B is under a tasks_rcu_exit_srcu SRCU critical section
>    (exit_notify() is between exit_tasks_rcu_start() and
>    exit_tasks_rcu_finish()), blocking TASK A
> 
> 6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C,
>    but it can't because TASK A waits for TASK B that waits for TASK C.
> 
> So there is a circular dependency:
> 
> _ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical
> section
> _ TASK B waits for TASK C to get reaped
> _ TASK C waits for TASK A to reap it.
> 
> I have no idea how to solve the situation without violating the pid_namespace
> rules and unshare() semantics (although I wish unshare(CLONE_NEWPID) had a less
> error prone behaviour with allowing creating more than one task belonging to the
> same namespace).
> 
> So probably having an SRCU read side critical section within exit_notify() is
> not a good idea, is there a solution to work around that for rcu tasks?
> 
  Thanks for the analysis!
  Add one more information: I tried to revert this commit only on top of
  v6.1-rc5 mainline by script, but it caused kernel make to fail, it could not
  confirm the bisect information is 100% accurate if I could not pass the
  revert step verification. I just provide all the information I could.

  And this issue is too difficult to me.
  If I find more clue, I will update the eamil.

  Thanks!
  BR.

> Thanks.