On Sat May 6, 2023 at 2:35 AM AEST, Doug Anderson wrote: > Hi, > > On Thu, May 4, 2023 at 7:36 PM Nicholas Piggin <npiggin@xxxxxxxxx> wrote: > > > > On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote: > > > From: Colin Cross <ccross@xxxxxxxxxxx> > > > > > > Implement a hardlockup detector that doesn't doesn't need any extra > > > arch-specific support code to detect lockups. Instead of using > > > something arch-specific we will use the buddy system, where each CPU > > > watches out for another one. Specifically, each CPU will use its > > > softlockup hrtimer to check that the next CPU is processing hrtimer > > > interrupts by verifying that a counter is increasing. > > > > Powerpc's watchdog has an SMP checker, did you see it? > > No, I wasn't aware of it. Interesting, it seems to basically enable > both types of hardlockup detectors together. If that really catches > more lockups, it seems like we could do the same thing for the buddy > system. It doesn't catch more lockups. On powerpc we don't have a reliable periodic NMI hence the SMP checker. But it is preferable that a CPU detects its own lockup because NMI IPIs can result in crashes if they are taken in certain critical sections. > If people want, I don't think it would be very hard to make > the buddy system _not_ exclusive of the perf system. Instead of having > the buddy system implement the "weak" functions I could just call the > buddy functions in the right places directly and leave the "weak" > functions for a more traditional hardlockup detector to implement. > Opinions? > > Maybe after all this lands, the powerpc watchdog could move to use the > common code? As evidenced by this patch series, there's not really a > reason for the SMP detection to be platform specific. The powerpc SMP checker could certainly move to common code if others wanted to use it. > > It's all to > > all rather than buddy which makes it more complicated but arguably > > bit better functionality. > > Can you come up with an example crash where the "all to all" would > work better than the simple buddy system provided by this patch? CPU2 CPU3 spin_lock_irqsave(A) spin_lock_irqsave(B) spin_lock_irqsave(B) spin_lock_irqsave(A) CPU1 will detect the lockup on CPU2, but CPU3's lockup won't be detected so we don't get the trace that can diagnose the bug. Another thing I actually found it useful for is you can easily see if a core (i.e., all threads in the core) or a chip has died. Maybe more useful when doing presilicon and bring up work or firmware hacking, but still useful. Thanks, Nick > It > seems like they would be equivalent, but I could be missing something. > Specifically they both need at least one non-locked-up CPU to detect a > problem. If one or more CPUs is locked up then we'll always detect it. > I suppose maybe you could provide a better error message at lockup > time saying that several CPUs were locked up and that could be > helpful. For now, I'd keep the current buddy system the way it is and > if you want to provide a patch improving things to be "all-to-all" in > the future that would be interesting to review.