Hi, The kernel version that we are using is 5.4.241. Recently, we encountered a case of rcu_sched thread soft lock. The kernel stack is as follows: crash> bt PID: 10 TASK: ffff8a83cc610000 CPU: 14 COMMAND: "rcu_sched" #0 [ffffc90000a64d50] machine_kexec at ffffffff8105714c #1 [ffffc90000a64da0] __crash_kexec at ffffffff8112c9bd #2 [ffffc90000a64e68] panic at ffffffff819e29d8 #3 [ffffc90000a64ee8] watchdog_timer_fn.cold.9 at ffffffff819e9a3f #4 [ffffc90000a64f18] __hrtimer_run_queues at ffffffff8110b5d0 #5 [ffffc90000a64f78] hrtimer_interrupt at ffffffff8110bdf0 #6 [ffffc90000a64fd8] smp_apic_timer_interrupt at ffffffff81c0268a #7 [ffffc90000a64ff0] apic_timer_interrupt at ffffffff81c01c0f --- <IRQ stack> --- #8 [ffffc9000006bdd8] apic_timer_interrupt at ffffffff81c01c0f [exception RIP: _raw_spin_unlock_irqrestore+14] #9 [ffffc9000006be80] force_qs_rnp at ffffffff810fa0b0 #10 [ffffc9000006beb8] rcu_gp_kthread at ffffffff810fd9b6 #11 [ffffc9000006bf10] kthread at ffffffff8109e40c #12 [ffffc9000006bf50] ret_from_fork at ffffffff81c00255 It seems that it is because a CPU did not report its QS in time. I see from vmcore that the 141 cpu did not report its QS The current process on the 141 CPU is as follows, it looks like 141 has detected the rcu stall and the print serial port log crash> bt -c 141 PID: 594039 TASK: ffff8a699dab8000 CPU: 141 COMMAND: "stress-ng-zombi" #0 [fffffe0001f70e40] crash_nmi_callback at ffffffff8104b54f #1 [fffffe0001f70e60] nmi_handle at ffffffff8101cf92 #2 [fffffe0001f70eb8] default_do_nmi at ffffffff8101d16e #3 [fffffe0001f70ed8] do_nmi at ffffffff8101d351 #4 [fffffe0001f70ef0] end_repeat_nmi at ffffffff81c015f3 [exception RIP: vprintk_emit+492] --- <NMI exception stack> --- #5 [ffffc90002038db0] vprintk_emit at ffffffff810eb19c #6 [ffffc90002038e00] printk at ffffffff819e6544 #7 [ffffc90002038e60] rcu_check_gp_kthread_starvation at ffffffff819e78bd #8 [ffffc90002038e88] rcu_sched_clock_irq.cold.84 at ffffffff819e8021 #9 [ffffc90002038ed0] update_process_times at ffffffff8110a8e4 #10 [ffffc90002038ee0] tick_sched_handle at ffffffff8111be22 #11 [ffffc90002038ef8] tick_sched_timer at ffffffff8111c147 #12 [ffffc90002038f18] __hrtimer_run_queues at ffffffff8110b5d0 #13 [ffffc90002038f78] hrtimer_interrupt at ffffffff8110bdf0 #14 [ffffc90002038fd8] smp_apic_timer_interrupt at ffffffff81c0268a #15 [ffffc90002038ff0] apic_timer_interrupt at ffffffff81c01c0f --- <IRQ stack> --- #16 [ffffc90004227c98] apic_timer_interrupt at ffffffff81c01c0f [exception RIP: __rb_erase_color+34] #17 [ffffc90004227d78] unlink_file_vma at ffffffff81241e0b #18 [ffffc90004227da0] free_pgtables at ffffffff8123708e #19 [ffffc90004227dd8] exit_mmap at ffffffff81243ea1 #20 [ffffc90004227e78] mmput at ffffffff81074fd4 #21 [ffffc90004227e90] do_exit at ffffffff8107d50c #22 [ffffc90004227f08] do_group_exit at ffffffff8107de6a #23 [ffffc90004227f30] __x64_sys_exit_group at ffffffff8107dee4 #24 [ffffc90004227f38] do_syscall_64 at ffffffff81002535 #25 [ffffc90004227f50] entry_SYSCALL_64_after_hwframe at ffffffff81c000a4 I think the reasons are as follows: - The rcu_sched needs to lock the rcu node where the 141 cpu is located to force the QS operations. - However 141 detects rcu stall and locks rcu node to print some serial logs. - The 141 cpu does touch nmi considering that printing port logs may cause system soft or hard locks. - However, The 141 cpu does not take into account other cpu, such as cpu 14, which may be blocked because it holds the lock of rcu node. I think printing serial port log is a more expensive operation, can we only let the 141 cpu in the critical section of the rcu node lock, collect the information it needs to print, and print the log it collected after it released the rcu node lock? I'm not sure if there is a better way to solve this problem? Regards, Meng En