Hi! > Add Eric. > > Well, due to unfortunate design zap_pid_ns_processes() can hang "forever" > if this namespace has a (zombie) task injected from the parent ns, this > task should be reaped by its parent. That zombie task was cloned by pid-1 process in that pid namespace. In my last reproduced log, the process tree in that pid namespace looks like ``` # unshare(CLONE_NEWPID | CLONE_NEWNS) npm start (pid 2522045) |__npm run zombie (pid 2522605) |__ sh -c "whle true; do echo zombie; sleep 1; done" (pid 2522869) ``` The `npm start (pid 2522045)` was stuck in kernel_wait4. And its child, `npm run zombie (pid 2522605)`, has two threads. One of them was in D status. As far as I know, pid-2522605 can't be reaped by its parent pid-2522045 until that thread returns from `synchronize_rcu_expedited`. ``` $ sudo cat /proc/2522605/task/*/stack [<0>] synchronize_rcu_expedited+0x177/0x1f0 [<0>] namespace_unlock+0xd6/0x1b0 [<0>] put_mnt_ns+0x73/0xa0 [<0>] free_nsproxy+0x1c/0x1b0 [<0>] switch_task_namespaces+0x5d/0x70 [<0>] exit_task_namespaces+0x10/0x20 [<0>] do_exit+0x2ce/0x500 [<0>] io_sq_thread+0x48e/0x5a0 [<0>] ret_from_fork+0x3c/0x60 [<0>] ret_from_fork_asm+0x1b/0x30 $ sudo cat /proc/2522605/task/2522645/status Name: iou-sqp-2522605 State: D (disk sleep) Tgid: 2522605 Ngid: 0 Pid: 2522645 PPid: 2522045 TracerPid: 0 Uid: 1000 1000 1000 1000 Gid: 1000 1000 1000 1000 FDSize: 0 Groups: 1000 NStgid: 2522605 25 NSpid: 2522645 40 NSpgid: 2522045 1 NSsid: 2522045 1 Kthread: 0 Threads: 2 SigQ: 0/128311 SigPnd: 0000000000000000 ShdPnd: 0000000000000100 SigBlk: fffffffffffbfeff SigIgn: 0000000001001000 SigCgt: 0000000000014602 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 00000000a80425fb CapAmb: 0000000000000000 NoNewPrivs: 0 Seccomp: 2 Seccomp_filters: 1 Speculation_Store_Bypass: vulnerable SpeculationIndirectBranch: always enabled Cpus_allowed: ff Cpus_allowed_list: 0-7 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 11 nonvoluntary_ctxt_switches: 21 ``` > > But zap_pid_ns_processes() shouldn't cause the soft-lockup, it should > sleep in kernel_wait4(). I run `cat /proc/2522045/status` and found that the status was kept switching between running and sleeping. But the kernel was still reporting soft lockup. And there is log from dmesg. The CPU 5 wasn't able to report the quiescent state. It seems that [rcu_flavor_sched_clock_irq][1] wasn't able to call [run_qs][2]. ``` rcu: INFO: rcu_sched self-detected stall on CPU rcu: 5-....: (15000 ticks this GP) idle=db4c/1/0x4000000000000000 softirq=14924115/14924115 fqs=7430 rcu: hardirqs softirqs csw/system rcu: number: 0 833 0 rcu: cputime: 0 0 29996 ==> 30000(ms) rcu: (t=15003 jiffies g=44379053 q=145851 ncpus=8) CPU: 5 PID: 2522045 Comm: npm start Tainted: G L 6.5.0-1021-azure #22~22.04.1-Ubuntu Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x20 Code: cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 e8 62 06 00 00 90 f7 c6 00 02 00 00 74 01 fb 5d <e9> d2 19 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffa666c4bafc30 EFLAGS: 00000206 RAX: 0000000000000001 RBX: ffffa666c4bafcc0 RCX: 0000000000000020 RDX: ffff8a3d82130928 RSI: 0000000000000282 RDI: ffff8a3d82130920 RBP: ffffa666c4bafc48 R08: ffff8a3d82130928 R09: ffff8a3d82130928 R10: 0000000000000040 R11: 0000000000000002 R12: ffff8a3d82130920 R13: ffff8a44f3db9980 R14: ffff8a44f3db9980 R15: ffff8a44f3db9970 FS: 0000000000000000(0000) GS:ffff8a451fd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000767ea57cc000 CR3: 00000005db436002 CR4: 0000000000370ee0 Call Trace: <IRQ> ? show_regs+0x6a/0x80 ? dump_cpu_task+0x71/0x90 ? rcu_dump_cpu_stacks+0xe8/0x180 ? print_cpu_stall+0x131/0x290 ? load_balance+0x160/0x870 ? check_cpu_stall+0x1d8/0x270 ? rcu_pending+0x32/0x1e0 ? rcu_sched_clock_irq+0x16e/0x290 ? update_process_times+0x63/0xa0 ? tick_sched_handle+0x28/0x70 ? tick_sched_timer+0x77/0x90 ? __pfx_tick_sched_timer+0x10/0x10 ? __hrtimer_run_queues+0x111/0x240 ? srso_alias_return_thunk+0x5/0x7f ? hrtimer_interrupt+0x101/0x240 ? hv_stimer0_isr+0x20/0x30 ? __sysvec_hyperv_stimer0+0x32/0x70 ? sysvec_hyperv_stimer0+0x7b/0x90 </IRQ> <TASK> ? asm_sysvec_hyperv_stimer0+0x1b/0x20 ? _raw_spin_unlock_irqrestore+0x19/0x20 ? remove_wait_queue+0x47/0x50 do_wait+0x19f/0x300 kernel_wait4+0xaf/0x150 ? __pfx_child_wait_callback+0x10/0x10 zap_pid_ns_processes+0x105/0x190 forget_original_parent+0x2e4/0x360 exit_notify+0x4a/0x210 do_exit+0x30b/0x500 ? srso_alias_return_thunk+0x5/0x7f ? wake_up_state+0x10/0x20 ? srso_alias_return_thunk+0x5/0x7f do_group_exit+0x35/0x90 __x64_sys_exit_group+0x18/0x20 x64_sys_call+0xd95/0x1ff0 do_syscall_64+0x56/0x80 ? srso_alias_return_thunk+0x5/0x7f ? handle_mm_fault+0x128/0x290 ? srso_alias_return_thunk+0x5/0x7f ? srso_alias_return_thunk+0x5/0x7f ? exit_to_user_mode_prepare+0x49/0x100 ? srso_alias_return_thunk+0x5/0x7f ? irqentry_exit_to_user_mode+0x19/0x30 ? srso_alias_return_thunk+0x5/0x7f ? irqentry_exit+0x1d/0x30 ? srso_alias_return_thunk+0x5/0x7f ? exc_page_fault+0x80/0x160 entry_SYSCALL_64_after_hwframe+0x73/0xdd RIP: 0033:0x75fce9367f8e Code: Unable to access opcode bytes at 0x75fce9367f64. RSP: 002b:00007ffc80c04b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 000075fce672d1b0 RCX: 000075fce9367f8e RDX: 000075fce93e30c0 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 00007ffc80c04b60 R08: 0000000000000024 R09: 0000000800000000 R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001 R13: 000075fce9081a90 R14: 0000000000000000 R15: 000075fce6771d50 </TASK> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 5-.... } 15359 jiffies s: 90777 root: 0x20/. rcu: blocking rcu_node structures (internal RCU debug): Sending NMI from CPU 4 to CPUs 5: NMI backtrace for cpu 5 CPU: 5 PID: 2522045 Comm: npm start Tainted: G L 6.5.0-1021-azure #22~22.04.1-Ubuntu Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 RIP: 0010:do_wait+0x11/0x300 Code: 8b 4d d4 e9 28 fd ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 <53> 48 89 fb 48 83 ec 08 48 8b 77 08 0f 1f 44 00 00 65 4c 8b 34 25 RSP: 0018:ffffa666c4bafc68 EFLAGS: 00000202 RAX: 0000000000000000 RBX: 0000000040000004 RCX: 0000000000000000 RDX: 0000000040000000 RSI: 0000000000000000 RDI: ffffa666c4bafc98 RBP: ffffa666c4bafc88 R08: ffff8a3d82130928 R09: ffff8a3d82130928 R10: 0000000000000040 R11: 0000000000000002 R12: 0000000000000000 R13: 0000000000000004 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8a451fd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000767ea57cc000 CR3: 00000005db436002 CR4: 0000000000370ee0 Call Trace: <NMI> ? show_regs+0x6a/0x80 ? nmi_cpu_backtrace+0x9c/0x100 ? nmi_cpu_backtrace_handler+0x11/0x20 ? nmi_handle+0x62/0x160 ? default_do_nmi+0x45/0x120 ? exc_nmi+0x19f/0x250 ? end_repeat_nmi+0x16/0x67 ? do_wait+0x11/0x300 ? do_wait+0x11/0x300 ? do_wait+0x11/0x300 </NMI> <TASK> kernel_wait4+0xaf/0x150 ? __pfx_child_wait_callback+0x10/0x10 zap_pid_ns_processes+0x105/0x190 forget_original_parent+0x2e4/0x360 exit_notify+0x4a/0x210 do_exit+0x30b/0x500 ? srso_alias_return_thunk+0x5/0x7f ? wake_up_state+0x10/0x20 ? srso_alias_return_thunk+0x5/0x7f do_group_exit+0x35/0x90 __x64_sys_exit_group+0x18/0x20 x64_sys_call+0xd95/0x1ff0 do_syscall_64+0x56/0x80 ? srso_alias_return_thunk+0x5/0x7f ? handle_mm_fault+0x128/0x290 ? srso_alias_return_thunk+0x5/0x7f ? srso_alias_return_thunk+0x5/0x7f ? exit_to_user_mode_prepare+0x49/0x100 ? srso_alias_return_thunk+0x5/0x7f ? irqentry_exit_to_user_mode+0x19/0x30 ? srso_alias_return_thunk+0x5/0x7f ? irqentry_exit+0x1d/0x30 ? srso_alias_return_thunk+0x5/0x7f ? exc_page_fault+0x80/0x160 entry_SYSCALL_64_after_hwframe+0x73/0xdd RIP: 0033:0x75fce9367f8e Code: Unable to access opcode bytes at 0x75fce9367f64. RSP: 002b:00007ffc80c04b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 000075fce672d1b0 RCX: 000075fce9367f8e RDX: 000075fce93e30c0 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 00007ffc80c04b60 R08: 0000000000000024 R09: 0000000800000000 R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001 R13: 000075fce9081a90 R14: 0000000000000000 R15: 000075fce6771d50 </TASK> watchdog: BUG: soft lockup - CPU#5 stuck for 85s! [npm start:2522045] Modules linked in: tls raw_diag unix_diag af_packet_diag netlink_diag udp_diag tcp_diag inet_diag xt_statistic xt_mark veth xt_comment xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_nat xt_MASQUERADE nft_chain_nat nf_nat nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter bridge stp llc overlay binfmt_misc nls_iso8859_1 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_owner xt_tcpudp nft_compat crct10dif_pclmul crc32_pclmul nf_tables polyval_clmulni polyval_generic ghash_clmulni_intel libcrc32c sha256_ssse3 joydev sha1_ssse3 hid_generic nfnetlink aesni_intel crypto_simd hyperv_drm cryptd hid_hyperv serio_raw drm_kms_helper hv_netvsc hid hyperv_keyboard pata_acpi drm_shmem_helper dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel msr drm efi_pstore i2c_core ip_tables x_tables autofs4 CPU: 5 PID: 2522045 Comm: npm start Tainted: G L 6.5.0-1021-azure #22~22.04.1-Ubuntu Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x20 Code: cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 e8 62 06 00 00 90 f7 c6 00 02 00 00 74 01 fb 5d <e9> d2 19 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffa666c4bafc30 EFLAGS: 00000206 RAX: 0000000000000001 RBX: ffffa666c4bafcc0 RCX: 0000000000000020 RDX: ffff8a3d82130928 RSI: 0000000000000282 RDI: ffff8a3d82130920 RBP: ffffa666c4bafc48 R08: ffff8a3d82130928 R09: ffff8a3d82130928 R10: 0000000000000040 R11: 0000000000000002 R12: ffff8a3d82130920 R13: ffff8a44f3db9980 R14: ffff8a44f3db9980 R15: ffff8a44f3db9970 FS: 0000000000000000(0000) GS:ffff8a451fd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000767ea57cc000 CR3: 00000005db436002 CR4: 0000000000370ee0 Call Trace: <IRQ> ? show_regs+0x6a/0x80 ? watchdog_timer_fn+0x1ce/0x230 ? __pfx_watchdog_timer_fn+0x10/0x10 ? __hrtimer_run_queues+0x111/0x240 ? srso_alias_return_thunk+0x5/0x7f ? hrtimer_interrupt+0x101/0x240 ? hv_stimer0_isr+0x20/0x30 ? __sysvec_hyperv_stimer0+0x32/0x70 ? sysvec_hyperv_stimer0+0x7b/0x90 </IRQ> <TASK> ? asm_sysvec_hyperv_stimer0+0x1b/0x20 ? _raw_spin_unlock_irqrestore+0x19/0x20 ? remove_wait_queue+0x47/0x50 do_wait+0x19f/0x300 kernel_wait4+0xaf/0x150 ? __pfx_child_wait_callback+0x10/0x10 zap_pid_ns_processes+0x105/0x190 forget_original_parent+0x2e4/0x360 exit_notify+0x4a/0x210 do_exit+0x30b/0x500 ? srso_alias_return_thunk+0x5/0x7f ? wake_up_state+0x10/0x20 ? srso_alias_return_thunk+0x5/0x7f do_group_exit+0x35/0x90 __x64_sys_exit_group+0x18/0x20 x64_sys_call+0xd95/0x1ff0 do_syscall_64+0x56/0x80 ? srso_alias_return_thunk+0x5/0x7f ? handle_mm_fault+0x128/0x290 ? srso_alias_return_thunk+0x5/0x7f ? srso_alias_return_thunk+0x5/0x7f ? exit_to_user_mode_prepare+0x49/0x100 ? srso_alias_return_thunk+0x5/0x7f ? irqentry_exit_to_user_mode+0x19/0x30 ? srso_alias_return_thunk+0x5/0x7f ? irqentry_exit+0x1d/0x30 ? srso_alias_return_thunk+0x5/0x7f ? exc_page_fault+0x80/0x160 entry_SYSCALL_64_after_hwframe+0x73/0xdd RIP: 0033:0x75fce9367f8e Code: Unable to access opcode bytes at 0x75fce9367f64. RSP: 002b:00007ffc80c04b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 000075fce672d1b0 RCX: 000075fce9367f8e RDX: 000075fce93e30c0 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 00007ffc80c04b60 R08: 0000000000000024 R09: 0000000800000000 R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001 R13: 000075fce9081a90 R14: 0000000000000000 R15: 000075fce6771d50 </TASK> ``` > > Any chance you can test the patch below? This patch makes sense anyway, > I'll send it later. But I am not sure it can fix your problem. Sure! Will do! Thanks > > Oleg. > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index dc48fecfa1dc..25f3cf679b35 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > */ > do { > clear_thread_flag(TIF_SIGPENDING); > + clear_thread_flag(TIF_NOTIFY_SIGNAL); > rc = kernel_wait4(-1, NULL, __WALL, NULL); > } while (rc != -ECHILD); > > > Wei, Fu [1]: https://elixir.bootlin.com/linux/v6.5/source/kernel/rcu/tree_plugin.h#L964 [2]: https://elixir.bootlin.com/linux/v6.5/source/kernel/rcu/tree_plugin.h#L848