> > > ``` > > # unshare(CLONE_NEWPID | CLONE_NEWNS) > > > > npm start (pid 2522045) > > |__npm run zombie (pid 2522605) > > |__ sh -c "whle true; do echo zombie; sleep 1; done" (pid 2522869) > > ``` > > only 3 processes? nothing is running? Is the last process 2522869 a > zombie too? Yes. The pid-2522045 sent SIGKILL to all the processes in that pid namespace, when it exited. The last process 2522869 was zombie as well. Sometimes, `npm start` could exit before `npm run zombie` forks `sh`. You might see there are only two processes in that pid namespace. > > Could you show your .config? In particular, CONFIG_PREEMPT... I'm using [6.5.0-1021-azure][1] kernel and preempt is disabled. Highlight part of .config. ``` $ cat /boot/config-6.5.0-1021-azure | grep _RCU CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_TASKS_RCU_GENERIC=y CONFIG_TASKS_RUDE_RCU=y CONFIG_TASKS_TRACE_RCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_RCU_NOCB_CPU=y # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set # CONFIG_RCU_LAZY is not set CONFIG_MMU_GATHER_RCU_TABLE_FREE=y # CONFIG_RCU_SCALE_TEST is not set # CONFIG_RCU_TORTURE_TEST is not set # CONFIG_RCU_REF_SCALE_TEST is not set CONFIG_RCU_CPU_STALL_TIMEOUT=60 CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0 CONFIG_RCU_CPU_STALL_CPUTIME=y # CONFIG_RCU_TRACE is not set # CONFIG_RCU_EQS_DEBUG is not set $ cat /boot/config-6.5.0-1021-azure | grep _PREEMPT CONFIG_PREEMPT_VOLUNTARY_BUILD=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_DYNAMIC is not set CONFIG_HAVE_PREEMPT_DYNAMIC=y CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y CONFIG_PREEMPT_NOTIFIERS=y CONFIG_DRM_I915_PREEMPT_TIMEOUT=640 CONFIG_DRM_I915_PREEMPT_TIMEOUT_COMPUTE=7500 # CONFIG_PREEMPTIRQ_DELAY_TEST is not set $ cat /boot/config-6.5.0-1021-azure | grep HZ CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set # CONFIG_NO_HZ_IDLE is not set CONFIG_NO_HZ_FULL=y CONFIG_NO_HZ=y # CONFIG_HZ_100 is not set CONFIG_HZ_250=y # CONFIG_HZ_300 is not set # CONFIG_HZ_1000 is not set CONFIG_HZ=250 CONFIG_MACHZ_WDT=m ``` > > > The `npm start (pid 2522045)` was stuck in kernel_wait4. And its child, > > so this is the init task in this namespace, Yes~ > > > `npm run zombie (pid 2522605)`, has two threads. One of them was in D status. > ... > > $ sudo cat /proc/2522605/task/*/stack > > [<0>] synchronize_rcu_expedited+0x177/0x1f0 > > [<0>] namespace_unlock+0xd6/0x1b0 > > [<0>] put_mnt_ns+0x73/0xa0 > > [<0>] free_nsproxy+0x1c/0x1b0 > > [<0>] switch_task_namespaces+0x5d/0x70 > > [<0>] exit_task_namespaces+0x10/0x20 > > [<0>] do_exit+0x2ce/0x500 > > [<0>] io_sq_thread+0x48e/0x5a0 > > [<0>] ret_from_fork+0x3c/0x60 > > [<0>] ret_from_fork_asm+0x1b/0x30 > > so I guess this is the trace of its sub-thread 2522645. Sorry for unclear message. Yes~ > > What about the process 2522605? Has it exited too? The process-2522605 has two threads. The main thread-2522605 was in zombie status. Yes. That main thread has exited as well. Only thread-2522645 was stuck in synchronize_rcu_expedited. > > > > But zap_pid_ns_processes() shouldn't cause the soft-lockup, it should > > > sleep in kernel_wait4(). > > > > I run `cat /proc/2522045/status` and found that the status was kept switching > > between running and sleeping. > > OK, this shouldn't happen in this case. So it really looks like it spins > in a busy-wait loop because TIF_NOTIFY_SIGNAL is not cleared. It can be > reported as sleeping because do_wait() sets/clears TASK_INTERRUPTIBLE, > although the window is small... > I can reproduce this issue in v5.15, v6.1, v6.5, v6.8, v6.9 and v6.10-rc2. All the kernels disable CONFIG_PREEMPT and PREEMPT_RCU. And it's very easy to reproduce this in v5.15.x with 8 vcores in few minutes. For the other versions of kernel, it could take 30 minutes or few hours. Rachel provides [golang-repro][2] which is similar to docker repro. It can be built as static binary which is friendly to reproduce. Hope this information can help. Thanks, Wei [1]: https://gist.github.com/fuweid/ae8bad349fee3e00a4f1ce82397831ac [2]: https://github.com/rlmenge/rcu-soft-lock-issue-repro?tab=readme-ov-file#golang-repro