Re: [PATCH v3 rcu 3/3] rcu: Finer-grained grace-period-end checks in rcu_dump_cpu_stacks()

Cheng-Jui Wang (王正睿) <Cheng-Jui.Wang@xxxxxxxxxxxx> · Tue, 29 Oct 2024 02:20:51 +0000

On Mon, 2024-10-28 at 17:22 -0700, Paul E. McKenney wrote:
> The result is that the current leaf rcu_node structure's ->lock is
> acquired only if a stack backtrace might be needed from the current CPU,
> and is held across only that CPU's backtrace. As a result, if there are

After upgrading our device to kernel-6.11, we encountered a lockup
scenario under stall warning. 
I had prepared a patch to submit, but I noticed that this series has
already addressed some issues, though it hasn't been merged into the
mainline yet. So, I decided to reply to this series for discussion on
how to fix it before pushing. Here is the lockup scenario We
encountered:

Devices: arm64 with only 8 cores
One CPU holds rnp->lock in rcu_dump_cpu_stack() while trying to dump
other CPUs, but it waits for the corresponding CPU to dump backtrace,
with a 10-second timeout.

   __delay()
   __const_udelay()
   nmi_trigger_cpumask_backtrace()
   arch_trigger_cpumask_backtrace()
   trigger_single_cpu_backtrace()
   dump_cpu_task()
   rcu_dump_cpu_stacks()  <- holding rnp->lock
   print_other_cpu_stall()
   check_cpu_stall()
   rcu_pending()
   rcu_sched_clock_irq()
   update_process_times()

However, the other 7 CPUs are waiting for rnp->lock on the path to
report qs.

   queued_spin_lock_slowpath()
   queued_spin_lock()
   do_raw_spin_lock()
   __raw_spin_lock_irqsave()
   _raw_spin_lock_irqsave()
   rcu_report_qs_rdp()
   rcu_check_quiescent_state()
   rcu_core()
   rcu_core_si()
   handle_softirqs()
   __do_softirq()
   ____do_softirq()
   call_on_irq_stack()

Since the arm64 architecture uses IPI instead of true NMI to implement
arch_trigger_cpumask_backtrace(), spin_lock_irqsave disables
interrupts, which is enough to block this IPI request.
Therefore, if other CPUs start waiting for the lock before receiving
the IPI, a semi-deadlock scenario like the following occurs:

CPU0                    CPU1                    CPU2
-----                   -----                   -----
lock_irqsave(rnp->lock)
                        lock_irqsave(rnp->lock)
                        <can't receive IPI>
<send ipi to CPU 1>
<wait CPU 1 for 10s>
                                                lock_irqsave(rnp->lock)
                                                <can't receive IPI>
<send ipi to CPU 2>
<wait CPU 2 for 10s>
...

In our scenario, with 7 CPUs to dump, the lockup takes nearly 70
seconds, causing subsequent useful logs to be unable to print, leading
to a watchdog timeout and system reboot.

This series of changes re-acquires the lock after each dump,
significantly reducing lock-holding time. However, since it still holds
the lock while dumping CPU backtrace, there's still a chance for two
CPUs to wait for each other for 10 seconds, which is still too long.
So, I would like to ask if it's necessary to dump backtrace within the
spinlock section?
If not, especially now that lockless checks are possible, maybe it can
be changed as follows?

-			if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)))
-				continue;
-			raw_spin_lock_irqsave_rcu_node(rnp, flags);
-			if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) {
+			if (data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)) {
				if (cpu_is_offline(cpu))
					pr_err("Offline CPU %d blocking current GP.\n", cpu);
				else
					dump_cpu_task(cpu);
				}
			}
-			raw_spin_unlock_irqrestore_rcu_node(rnp,
flags);

Or should this be considered an arm64 issue, and they should switch to
true NMI, otherwise, they shouldn't use
nmi_trigger_cpumask_backtrace()?