Summary for the SPARC maintainers: The NMI watchdog is firing on Sunfire 280R and Sun Blade2500 systems with one or both processors in cheetah_xcall_deliver(). This has been seen under 3.0, 3.2 and 3.3 and seems to be associated with disk I/O. Full bug log is at: http://bugs.debian.org/648766 On Tue, 2012-04-03 at 20:56 -0400, Kieron Gillespie wrote: > I have also noticed, that if I am reading the trace correctly that in > both of my cases, and the original bug submitter's, and a bug posted on > old.nabble.com's case the crash always seems to happen when one CPU is > doing cheetah_xcall_deliver, and the other CPU is in the same > instruction in tl0_irq15. Here is a link to the post. [...] tl0_irq15 seems to be part of the NMI watchdog (for detecting that the kernel has hung), so you should always see that in a backtrace when the NMI watchdog fires. It's not part of the problem. cheetah_xcall_deliver() does appear to be relevant to the problem and it looks like it could loop indefinitely - though presumably only if a processor is behaving strangely? It appears to periodically enable and disable interrupts, but then I'm not sure how the PSTATE.IE and PIL interrupt control fields interact and I don't think this will reset the NMI watchdog. In any case, it seems like there's a serious problem if it's looping for a long time, whether or not interrupts remain disabled. Ben. -- Ben Hutchings Larkinson's Law: All laws are basically false.
Attachment:
signature.asc
Description: This is a digitally signed message part