On Thu, 2024-10-31 at 14:27 -0700, Doug Anderson wrote: > Hi, > > On Wed, Oct 30, 2024 at 10:03 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > > > > Note that: > > > > > > > > > > * Switching to real NMIs is impossible on many existing arm64 CPUs. > > > > > The hardware support simply isn't there. Pseudo-NMIs should work fine > > > > > and are (in nearly all cases) just as good as NMIs but they have a > > > > > small performance impact. There are also compatibility issues with > > > > > some pre-existing bootloaders. ...so code can't assume even Pseudo-NMI > > > > > will work and needs to be able to fall back. Prior to my recent > > > > > changes arm64 CPUs wouldn't even do stacktraces in some scenarios. Now > > > > > at least they fall back to regular IPIs. > > > > > > > > But those IPIs are of no help whatsoever when the target CPU is spinning > > > > with interrupts disabled, which is the scenario under discussion. > > > > > > Right that we can't trace the target CPU spinning with interrupts > > > disabled. > > > > You can by invoking sched_show_task(cpu_curr(cpu)) on the CPU requesting > > the backtrace. The resulting backtrace will not be as good as you > > would get from using an NMI, but if you don't have NMIs, it is better > > than nothing. > > > > > ...but in the case where NMI isn't available it _still_ > > > makes sense to make arch_trigger_cpumask_backtrace() fall back to IPI > > > because: > > > > > > 1. There are many cases where the trigger.*cpu.*backtrace() class of > > > functions are used to trace CPUs that _aren't_ looping with interrupts > > > disabled. We still want to be able to backtrace in that case. For > > > instance, you can see in "panic.c" that there are cases where > > > trigger_all_cpu_backtrace() is called. It's valuable to make cases > > > like this (where there's no indication that a CPU is locked) actually > > > work. > > > > > > 2. Even if there's a CPU that's looping with interrupts disabled and > > > we can't trace it because of no NMI, it could still be useful to get > > > backtraces of other CPUs. It's possible that what the other CPUs are > > > doing could give a hint about the CPU that's wedged. This isn't a > > > great hint, but some info is better than no info. > > > > And it also makes sense for an IRQ-based backtrace to fall back to > > something like the aforementioned sched_show_task(cpu_curr(cpu)) if > > the destination CPU has interrupts disabled. > > > > > > Hence my suggestion that arm64, when using IRQs to request stack > > > > backtraces, have an additional short timeout (milliseconds) in > > > > order to fall back to remote stack tracing. In many cases, this > > > > fallback happens automatically, as you can see in dump_cpu_task(). > > > > If trigger_single_cpu_backtrace() returns false, the stack is dumped > > > > remotely. > > > > > > I think the answer here is that the API needs to change. The whole > > > boolean return value for trigger.*cpu.*backtrace() is mostly ignored > > > by callers. Yes, you've pointed at one of the two places that it's not > > > ignored and it tries a reasonable fallback, but all the other callers > > > just silently do nothing when the function returns false. That led to > > > many places where arm64 devices were simply not getting _any_ > > > stacktrace. > > > > > > Perhaps we need some type of API that says "I actually have a > > > fallback, so if you don't think the stracktrace will succeed then it's > > > OK to return false". > > > > Boolean is fine for trigger_single_cpu_backtrace(), which is directed at > > a single CPU. And the one call to this function that does not currently > > check its return value could just call dump_cpu_task() instead. > > > > There are only 20 or so uses of all of these functions, so the API can > > change, give or take the pain involved in changing architecture-specific > > code. > > Right. Falling back to "sched_show_task(cpu_curr(cpu))" or something > similar if the IPI doesn't go through is a good idea. Indeed, falling > back to that if the NMI doesn't go through is _also_ a good idea, > right? ...and we don't want to change all 20 callers to all add a > fallback. To that end, it feels like someone should change it so that > the common code includes the fallback and we get rid of the boolean > return value. > > > > > > * Even if we decided that we absolutely had to disable stacktrades on > > > > > arm64 CPUs without some type of NMI, that won't fix arm32. arm32 has > > > > > been using regular IPIs for backtraces like this for many, many years. > > > > > Maybe folks don't care as much about arm32 anymore but it feels bad if > > > > > we totally break it. > > > > > > > > Because arm32 isn't used much for datacenter workloads, it will not > > > > be suffering from this issue. Not enough to have motivated anyone to > > > > fix it, anyway. In contrast, in the datacenter, CPUs really can and > > > > do have interrupts disabled for many seconds. (No, this is not a good > > > > thing, but it is common enough for us to need to avoid disabling other > > > > debugging facilities.) > > > > > > > > So it might well be that arm32 will continue to do just fine with > > > > IRQ-based stack tracing. Or maybe someday arm32 will also need to deal > > > > with this same issue. But no "maybe" for arm64, given its increasing > > > > use in the datacenter. > > > > > > IMO the answer here is that anyone in datacenters should just turn on > > > pseudo NMI (or NMI on newer arm64 CPUs). > > > > Suppose datacenters all do this. What part of this issue remains? > > I believe that if datacenters enable pseudo NMIs then the issue is > "fixed" and thus only remains for anyone else (datacenters or other > users of Linux) that don't turn on pseudo NMIs. > > > > > > > IMO waiting 10 seconds for a backtrace is pretty crazy to begin with. > > > > > What about just changing that globally to 1 second? If not, it doesn't > > > > > feel like it would be impossibly hard to make an arch-specific > > > > > callback to choose the time and that callback could even take into > > > > > account whether we managed to get an NMI. I'd be happy to review such > > > > > a patch. > > > > > > > > Let's please keep the 10-second timeout for NMI-based backtraces. > > > > > > > > But I take it that you are not opposed to a shorter timeout for the > > > > special case of IRQ-based stack backtrace requests? > > > > > > IMO the 10 second is still awfully long (HW watchdogs can trigger > > > during that time), but I'm not that upset by it. I'm OK with shorter > > > timeouts for IRQ-based ones, though I'm not sure I'd be OK w/ timeouts > > > measured in milliseconds unless the caller has actually said that they > > > can handle a "false" return or the caller says that it's more > > > important to be quick than to wait for a long time. > > > > If you are using NMIs, and the CPU doesn't respond in 10 seconds, > > something is likely broken. Maybe bad firmware or bad hardware. You are > > right that watchdogs might trigger, but they are likely already triggering > > due to the target CPU being so firmly locked up that it is not responding > > even to NMIs. > > Yeah, if NMIs aren't working then things are pretty bad. It still > seems like it would be better to let Linux dump more info before a > hardware watchdog triggers. For instance it could perhaps call > something akin to "sched_show_task(cpu_curr(cpu))". > > I don't feel super strongly about it, but in my mind even if a CPU is > unresponsive to NMIs for 1-2 seconds then things are in pretty bad > shape and I'd want to start dumping some more info before the hardware > watchdog kicks in and we can't dump anything. > > > > On the other hand, when you are not using NMIs, then I agree > > you want a shorter timeout before remotely tracing the staek via > > sched_show_task(cpu_curr(cpu)). I picked a few milliseconds, but if > > datacenters are using real NMIs or pseudo NMIs (and thus are not being > > blocked by normal interrupt disabling), then I must defer to others on > > the default timeout. Much depends on the workload and configuration. > > As I said, I don't feel strongly about this, so if people want to keep > the timeout or shorten it or add a knob, any of those would be fine > with me. I personally would object to a timeout that's _too_ short or > a timeout that is longer than the current 10 seconds. > > -Doug I hope this fix can be applied to the stable kernels. Modifying an API that is called by many architectures may encounter difficulties during the backporting process. How about this: we keep the original nmi_trigger_cpumask_backtrace() but add a __nmi_trigger_cpumask_backtrace() in the middle that can accept a timeout parameter. +#define NMI_BACKTRACE_TIMEOUT (10 * 1000) void nmi_trigger_cpumask_backtrace(...) +{ + __nmi_trigger_cpumask_backtrace(..., NMI_BACKTRACE_TIMEOUT); +} + +void __nmi_trigger_cpumask_backtrace(..., unsigned long timeout) { ... - for (i = 0; i < 10 * 1000; i++) { + for (i = 0; i < timeout; i++) { Then, the arch_trigger_cpumask_backtrace() for different architectures can pass in their desired timeout, for example: /* 1 seconds if fallbacked to IPI */ timeout = has_nmi() ? NMI_BACKTRACE_TIMEOUT : 1000; __nmi_trigger_cpu_mask_backtrace(..., timeout); This way, archiectures that want different timeouts can modify it themselves without having to implement complex logic on their own. -Cheng-Jui