On Mon, 2024-10-28 at 17:22 -0700, Paul E. McKenney wrote: > The result is that the current leaf rcu_node structure's ->lock is > acquired only if a stack backtrace might be needed from the current CPU, > and is held across only that CPU's backtrace. As a result, if there are After upgrading our device to kernel-6.11, we encountered a lockup scenario under stall warning. I had prepared a patch to submit, but I noticed that this series has already addressed some issues, though it hasn't been merged into the mainline yet. So, I decided to reply to this series for discussion on how to fix it before pushing. Here is the lockup scenario We encountered: Devices: arm64 with only 8 cores One CPU holds rnp->lock in rcu_dump_cpu_stack() while trying to dump other CPUs, but it waits for the corresponding CPU to dump backtrace, with a 10-second timeout. __delay() __const_udelay() nmi_trigger_cpumask_backtrace() arch_trigger_cpumask_backtrace() trigger_single_cpu_backtrace() dump_cpu_task() rcu_dump_cpu_stacks() <- holding rnp->lock print_other_cpu_stall() check_cpu_stall() rcu_pending() rcu_sched_clock_irq() update_process_times() However, the other 7 CPUs are waiting for rnp->lock on the path to report qs. queued_spin_lock_slowpath() queued_spin_lock() do_raw_spin_lock() __raw_spin_lock_irqsave() _raw_spin_lock_irqsave() rcu_report_qs_rdp() rcu_check_quiescent_state() rcu_core() rcu_core_si() handle_softirqs() __do_softirq() ____do_softirq() call_on_irq_stack() Since the arm64 architecture uses IPI instead of true NMI to implement arch_trigger_cpumask_backtrace(), spin_lock_irqsave disables interrupts, which is enough to block this IPI request. Therefore, if other CPUs start waiting for the lock before receiving the IPI, a semi-deadlock scenario like the following occurs: CPU0 CPU1 CPU2 ----- ----- ----- lock_irqsave(rnp->lock) lock_irqsave(rnp->lock) <can't receive IPI> <send ipi to CPU 1> <wait CPU 1 for 10s> lock_irqsave(rnp->lock) <can't receive IPI> <send ipi to CPU 2> <wait CPU 2 for 10s> ... In our scenario, with 7 CPUs to dump, the lockup takes nearly 70 seconds, causing subsequent useful logs to be unable to print, leading to a watchdog timeout and system reboot. This series of changes re-acquires the lock after each dump, significantly reducing lock-holding time. However, since it still holds the lock while dumping CPU backtrace, there's still a chance for two CPUs to wait for each other for 10 seconds, which is still too long. So, I would like to ask if it's necessary to dump backtrace within the spinlock section? If not, especially now that lockless checks are possible, maybe it can be changed as follows? - if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu))) - continue; - raw_spin_lock_irqsave_rcu_node(rnp, flags); - if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { + if (data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)) { if (cpu_is_offline(cpu)) pr_err("Offline CPU %d blocking current GP.\n", cpu); else dump_cpu_task(cpu); } } - raw_spin_unlock_irqrestore_rcu_node(rnp, flags); Or should this be considered an arm64 issue, and they should switch to true NMI, otherwise, they shouldn't use nmi_trigger_cpumask_backtrace()?