Hi, On Mon, May 8, 2023 at 8:52 AM Doug Anderson <dianders@xxxxxxxxxxxx> wrote: > > Hmmm, but I don't think you really need "all-to-all" checking to get > the stacktraces you want, do you? Each CPU can be "watching" exactly > one other CPU, but then when we actually lock up we could check all of > them and dump stacks on all the ones that are locked up. I think this > would be a fairly easy improvement for the buddy system. I'll leave it > out for now just to keep things simple for the initial landing, but it > wouldn't be hard to add. Then I think the two SMP systems (buddy vs. > all-to-all) would be equivalent in terms of functionality? FWIW, I take back my "this would be fairly easy" comment. :-P ...or, at least I'll acknowledge that the easy way has some tradeoffs. It wouldn't be trivially easy to just snoop on the data of the other buddies because the watching processors aren't necessarily synchronized with each other. That being said, if someone really wanted to report on other locked CPUs before doing a panic() and was willing to delay the panic, it probably wouldn't be too hard to put in a mode where the CPU that detects the first lockup could do some extra work to look for lockups. Maybe it could send a normal IPI to other CPUs and see if they respond or maybe it could take over monitoring all CPUs and wait one extra period. In any case, I'm not planning on implementing this now, but at least wanted to document thoughts. ;-) > With my simplistic solution > of just allowing the buddy detector to be enabled in parallel with a > perf-based detector then we wouldn't have this level of coordination, > but I'll assume that's OK for the initial landing. I dug into this more as well and I also wanted to note that, at least for now, I'm not going to include support to turn on both the buddy and perf lockup detectors in the common core. In order to do this and not have them stomp on each other then I think we need extra coordination or two copies of the interrupt count / saved interrupt count and, at least at this point in time, it doesn't seem worth it for a halfway solution. From everything I've heard there is a push on many x86 machines to get off the perf lockup detector anyway to free up the resources. Someone could look at adding this complexity later. -Doug