Pa-ckers,
Just for the records, I'd like to raise some attention to what seems
like a pretty old bug in our IRQ code that is apparently still
affecting us.
Long story short: while trying to figure out why the recently attached
10-disk bay was killing the Debian "lafayette" autobuilder during raid
resync, I noticed that irqbalance was part of the default Debian
autobuilder setup.
The nastiness of irqbalance has been discussed before, and I
remembered having had issues in the past (5+ years ago) on my parisc
machines with that daemon. I couldn't find a pointer to a m-l thread,
I don't remember if I discussed that on IRC or elsewhere.
Anyway, turned out disabling irqbalance "fixed" the crash (and by
crash I mean HPMC). IIRC, the general idea is that when irqbalance
reroutes IRQ under heavy interrupt load, a race occurs by which one
interrupt request might end up delivered to the wrong CPU, HPMC'ing
the machine.
I have no particular opinion on whether it should be expected that
something as stupid as irqbalance could crash a system, but others
seem to believe it shouldn't (claiming "it works on *real* [read: x86]
hardware").
Now, I'm quite convinced that irqbalance could be one of the (major?)
cause of instability of the parisc autobuilders. AFAIU, they've
decided to disable it on their setup, maybe the situation will improve
there. Still, irqbalance is only the messenger, and I'm wondering
whether that apparent bug in our IRQ code could also be responsible
for other issues we're still having.
It's been a very long time since I last touched that code, and tbh I
never fully mastered it anyway, but I thought it'd be a good thing to
have a trace that this bug is still there, and maybe it will ring a
bell to others...
HTH
T-Bone
--
Thibaut Varène
http://www.parisc-linux.org/~varenet/--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html