Glyn Astill wrote:
There's a separate counter, and /proc/interrupts report, for spurious interrupts.From: Kevin D. Kissell <kevink@xxxxxxxxxxxxx>Your description sounds an awful lot like failures I've seen when interrupts get lost or blocked for some reason (could be hardware, the kernel, or some interaction between them). Have you looked at to see if "Spurious" interrupts are occurring, or if the rate of serviced timer and I/O interrupts decreases or increases as the system degrades?No I haven't checked - but I will. What would I be looking for that would stick out as "spurious"? The type of interrupt, qty or random interrupts appearing and dissapearing? Ah. I had misunderstood your messages to have stated that you had one Qube2 that exhibited the behavior while others did not. In the actual case, it definitely sounds like a kernel interrupt management problem, either at the level of the interrupt controller support code or some bit of low-level management of the Status.IM interrupt mask. If you can force the kernel to dump the state of the Status and Cause registers, as well as that of whatever outboard interrupt controller is on that thing, that would be good. I used to have a hook in the NMI handler of my Malta kernels for that, which was useful when I was debugging the SMTC interrupt support, which was pretty subtle and nasty. And why this failure mode sounds vaguely familiar. ;o) The interrupt ack/mask/enable machinery has changed and standardized (for the better) since the Qube2 was a current product, and the controller "chip" struct/functions being used may not in fact be entirely correct for the platform, e.g. you may have non-atomic changes to interrupt masks being done that screw up in the presence of nested service.When the system becomes unresponsive, by any chance does it "wake up" after 10-20 minutes (the time for the Count register to wrap)?Not that I've noticed, I just see it degrade further and further untill it dies over the course of an hour or so.If other Qube2s don't exhibit this behavior with a given Linux kernel, but yours does, and yet yours runs NetBSD OK, it suggests that there's a difference in interrupt setup/handling between the two systems that just happens to work around a hardware problem on your board.I'm sure that's a valid possibility, however I do have two of these machines and I have tried both with the same results. That's not a very scientific description of a failure. I mean, did the Qube2 literally jump off the table? ;o)I also had a problem back when I tried etch with the 2.6.18 kernel, however in this case I saw no degraded performance at all, however after a some of hours of activity (anywhere between 2 and 24+) it'd just fall on it's ass. Regards, Kevin K. |