Re: [PATCH v3 rcu 3/3] rcu: Finer-grained grace-period-end checks in rcu_dump_cpu_stacks()

Cheng-Jui Wang (王正睿) <Cheng-Jui.Wang@xxxxxxxxxxxx> · Fri, 1 Nov 2024 01:44:15 +0000

On Thu, 2024-10-31 at 14:27 -0700, Doug Anderson wrote:
> Hi,
> 
> On Wed, Oct 30, 2024 at 10:03 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > 
> > > > > Note that:
> > > > > 
> > > > > * Switching to real NMIs is impossible on many existing arm64 CPUs.
> > > > > The hardware support simply isn't there. Pseudo-NMIs should work fine
> > > > > and are (in nearly all cases) just as good as NMIs but they have a
> > > > > small performance impact. There are also compatibility issues with
> > > > > some pre-existing bootloaders. ...so code can't assume even Pseudo-NMI
> > > > > will work and needs to be able to fall back. Prior to my recent
> > > > > changes arm64 CPUs wouldn't even do stacktraces in some scenarios. Now
> > > > > at least they fall back to regular IPIs.
> > > > 
> > > > But those IPIs are of no help whatsoever when the target CPU is spinning
> > > > with interrupts disabled, which is the scenario under discussion.
> > > 
> > > Right that we can't trace the target CPU spinning with interrupts
> > > disabled.
> > 
> > You can by invoking sched_show_task(cpu_curr(cpu)) on the CPU requesting
> > the backtrace.  The resulting backtrace will not be as good as you
> > would get from using an NMI, but if you don't have NMIs, it is better
> > than nothing.
> > 
> > >           ...but in the case where NMI isn't available it _still_
> > > makes sense to make arch_trigger_cpumask_backtrace() fall back to IPI
> > > because:
> > > 
> > > 1. There are many cases where the trigger.*cpu.*backtrace() class of
> > > functions are used to trace CPUs that _aren't_ looping with interrupts
> > > disabled. We still want to be able to backtrace in that case. For
> > > instance, you can see in "panic.c" that there are cases where
> > > trigger_all_cpu_backtrace() is called. It's valuable to make cases
> > > like this (where there's no indication that a CPU is locked) actually
> > > work.
> > > 
> > > 2. Even if there's a CPU that's looping with interrupts disabled and
> > > we can't trace it because of no NMI, it could still be useful to get
> > > backtraces of other CPUs. It's possible that what the other CPUs are
> > > doing could give a hint about the CPU that's wedged. This isn't a
> > > great hint, but some info is better than no info.
> > 
> > And it also makes sense for an IRQ-based backtrace to fall back to
> > something like the aforementioned sched_show_task(cpu_curr(cpu)) if
> > the destination CPU has interrupts disabled.
> > 
> > > > Hence my suggestion that arm64, when using IRQs to request stack
> > > > backtraces, have an additional short timeout (milliseconds) in
> > > > order to fall back to remote stack tracing.  In many cases, this
> > > > fallback happens automatically, as you can see in dump_cpu_task().
> > > > If trigger_single_cpu_backtrace() returns false, the stack is dumped
> > > > remotely.
> > > 
> > > I think the answer here is that the API needs to change. The whole
> > > boolean return value for trigger.*cpu.*backtrace() is mostly ignored
> > > by callers. Yes, you've pointed at one of the two places that it's not
> > > ignored and it tries a reasonable fallback, but all the other callers
> > > just silently do nothing when the function returns false. That led to
> > > many places where arm64 devices were simply not getting _any_
> > > stacktrace.
> > > 
> > > Perhaps we need some type of API that says "I actually have a
> > > fallback, so if you don't think the stracktrace will succeed then it's
> > > OK to return false".
> > 
> > Boolean is fine for trigger_single_cpu_backtrace(), which is directed at
> > a single CPU.  And the one call to this function that does not currently
> > check its return value could just call dump_cpu_task() instead.
> > 
> > There are only 20 or so uses of all of these functions, so the API can
> > change, give or take the pain involved in changing architecture-specific
> > code.
> 
> Right. Falling back to "sched_show_task(cpu_curr(cpu))" or something
> similar if the IPI doesn't go through is a good idea. Indeed, falling
> back to that if the NMI doesn't go through is _also_ a good idea,
> right? ...and we don't want to change all 20 callers to all add a
> fallback. To that end, it feels like someone should change it so that
> the common code includes the fallback and we get rid of the boolean
> return value.
> 
> > > > > * Even if we decided that we absolutely had to disable stacktrades on
> > > > > arm64 CPUs without some type of NMI, that won't fix arm32. arm32 has
> > > > > been using regular IPIs for backtraces like this for many, many years.
> > > > > Maybe folks don't care as much about arm32 anymore but it feels bad if
> > > > > we totally break it.
> > > > 
> > > > Because arm32 isn't used much for datacenter workloads, it will not
> > > > be suffering from this issue.  Not enough to have motivated anyone to
> > > > fix it, anyway.  In contrast, in the datacenter, CPUs really can and
> > > > do have interrupts disabled for many seconds.  (No, this is not a good
> > > > thing, but it is common enough for us to need to avoid disabling other
> > > > debugging facilities.)
> > > > 
> > > > So it might well be that arm32 will continue to do just fine with
> > > > IRQ-based stack tracing.  Or maybe someday arm32 will also need to deal
> > > > with this same issue.  But no "maybe" for arm64, given its increasing
> > > > use in the datacenter.
> > > 
> > > IMO the answer here is that anyone in datacenters should just turn on
> > > pseudo NMI (or NMI on newer arm64 CPUs).
> > 
> > Suppose datacenters all do this.  What part of this issue remains?
> 
> I believe that if datacenters enable pseudo NMIs then the issue is
> "fixed" and thus only remains for anyone else (datacenters or other
> users of Linux) that don't turn on pseudo NMIs.
> 
> 
> > > > > IMO waiting 10 seconds for a backtrace is pretty crazy to begin with.
> > > > > What about just changing that globally to 1 second? If not, it doesn't
> > > > > feel like it would be impossibly hard to make an arch-specific
> > > > > callback to choose the time and that callback could even take into
> > > > > account whether we managed to get an NMI. I'd be happy to review such
> > > > > a patch.
> > > > 
> > > > Let's please keep the 10-second timeout for NMI-based backtraces.
> > > > 
> > > > But I take it that you are not opposed to a shorter timeout for the
> > > > special case of IRQ-based stack backtrace requests?
> > > 
> > > IMO the 10 second is still awfully long (HW watchdogs can trigger
> > > during that time), but I'm not that upset by it. I'm OK with shorter
> > > timeouts for IRQ-based ones, though I'm not sure I'd be OK w/ timeouts
> > > measured in milliseconds unless the caller has actually said that they
> > > can handle a "false" return or the caller says that it's more
> > > important to be quick than to wait for a long time.
> > 
> > If you are using NMIs, and the CPU doesn't respond in 10 seconds,
> > something is likely broken.  Maybe bad firmware or bad hardware.  You are
> > right that watchdogs might trigger, but they are likely already triggering
> > due to the target CPU being so firmly locked up that it is not responding
> > even to NMIs.
> 
> Yeah, if NMIs aren't working then things are pretty bad. It still
> seems like it would be better to let Linux dump more info before a
> hardware watchdog triggers. For instance it could perhaps call
> something akin to "sched_show_task(cpu_curr(cpu))".
> 
> I don't feel super strongly about it, but in my mind even if a CPU is
> unresponsive to NMIs for 1-2 seconds then things are in pretty bad
> shape and I'd want to start dumping some more info before the hardware
> watchdog kicks in and we can't dump anything.
> 
> 
> > On the other hand, when you are not using NMIs, then I agree
> > you want a shorter timeout before remotely tracing the staek via
> > sched_show_task(cpu_curr(cpu)).  I picked a few milliseconds, but if
> > datacenters are using real NMIs or pseudo NMIs (and thus are not being
> > blocked by normal interrupt disabling), then I must defer to others on
> > the default timeout.  Much depends on the workload and configuration.
> 
> As I said, I don't feel strongly about this, so if people want to keep
> the timeout or shorten it or add a knob, any of those would be fine
> with me. I personally would object to a timeout that's _too_ short or
> a timeout that is longer than the current 10 seconds.
> 
> -Doug

I hope this fix can be applied to the stable kernels. Modifying an API
that is called by many architectures may encounter difficulties during
the backporting process.

How about this: we keep the original nmi_trigger_cpumask_backtrace()
but add a __nmi_trigger_cpumask_backtrace() in the middle that can
accept a timeout parameter.

+#define NMI_BACKTRACE_TIMEOUT (10 * 1000)

void nmi_trigger_cpumask_backtrace(...)
+{
+	__nmi_trigger_cpumask_backtrace(..., NMI_BACKTRACE_TIMEOUT);
+}
+
+void __nmi_trigger_cpumask_backtrace(..., unsigned long timeout)
{
...
-	for (i = 0; i < 10 * 1000; i++) {
+	for (i = 0; i < timeout; i++) {

Then, the arch_trigger_cpumask_backtrace() for different architectures
can pass in their desired timeout, for example:

	/* 1 seconds if fallbacked to IPI */
	timeout = has_nmi() ? NMI_BACKTRACE_TIMEOUT : 1000;
	__nmi_trigger_cpu_mask_backtrace(..., timeout);

This way, archiectures that want different timeouts can modify it
themselves without having to implement complex logic on their own.

-Cheng-Jui