Re: Qube2 slowly dies

"Kevin D. Kissell" <kevink@xxxxxxxxxxxxx> · Fri, 12 Jun 2009 12:45:26 -0700

Glyn Astill wrote:

    From: Kevin D. Kissell <kevink@xxxxxxxxxxxxx>

    Your description sounds an awful lot
like failures I've seen when 
interrupts get lost or blocked for some reason (could be
hardware, the 
kernel, or some interaction between them).  Have you
looked at 
 to see if "Spurious" interrupts are
occurring, or if 
the rate of serviced timer and I/O interrupts decreases or
increases as 
the system degrades?

No I haven't checked - but I will. What would I be looking for that would stick out as "spurious"? The type of interrupt, qty or random interrupts appearing and dissapearing?

There's a separate counter, and /proc/interrupts report, for spurious
interrupts.

    When the system becomes unresponsive, by any 
chance does it "wake up" after 10-20 minutes (the time for
the Count 
register to wrap)?

Not that I've noticed, I just see it degrade further and further untill it dies over the course of an hour or so.

    If other Qube2s don't exhibit this behavior with a given
Linux kernel, 
but yours does, and yet yours runs NetBSD OK, it suggests
that there's a 
difference in interrupt setup/handling between the two
systems that just 
happens to work around a hardware problem on your board.

I'm sure that's a valid possibility, however I do have two of these machines and I have tried both with the same results.

Ah.  I had misunderstood your messages to have stated that you had one
Qube2 that exhibited the behavior while others did not.  In the actual
case, it definitely sounds like a kernel interrupt management problem,
either at the level of the interrupt controller support code or some
bit of low-level management of the Status.IM interrupt mask.  If you
can force the kernel to dump the state of the Status and Cause
registers, as well as that of whatever outboard interrupt controller is
on that thing, that would be good.  I used to have a hook in the NMI
handler of my Malta kernels for that, which was useful when I was
debugging the SMTC interrupt support, which was pretty subtle and
nasty.  And why this failure mode sounds vaguely familiar.  ;o)  The
interrupt ack/mask/enable machinery  has changed and standardized (for
the better) since the Qube2 was a current product, and the controller
"chip" struct/functions being used may not in fact be entirely correct
for the platform, e.g. you may have non-atomic changes to interrupt
masks being done that screw up in the presence of nested service.

  I also had a problem back when I tried etch with the 2.6.18 kernel, however in this case I saw no degraded performance at all, however after a some of hours of activity (anywhere between 2 and 24+) it'd just fall on it's ass.

That's not a very scientific description of a failure.  I mean, did the
Qube2 literally jump off the table? ;o)

          Regards,

          Kevin K.