Re: IP27: Random hard locks after ~16hrs uptime

Joshua Kinard <kumba@xxxxxxxxxx> · Sun, 08 Feb 2015 19:53:41 -0500

On 02/08/2015 07:06, Maciej W. Rozycki wrote:
> On Sat, 7 Feb 2015, Joshua Kinard wrote:
> 
>> I've had my Onyx2 running quite a bit lately doing compile runs, and it seems
>> that after about ~16 hours, there's a random possibility that the machine just
>> completely stops.  No errors printed anywhere, serial becomes completely
>> unresponsive.  I have to issue a 'rst' from the MSC to bring it back up again.
> 
>  If the time spent up is always similar, then one possibility is a counter 
> wraparound or suchlike that is not handled correctly (i.e. the carry from 
> the topmost bit is not taken into account), causing a kernel deadlock.

I believe I've pinned the problem down to the block I/O layer and points
beneath, such as SCSI core, qla1280, etc.  I am using an out-of-tree patch to
add the BFQ I/O scheduler in, so that may also be a cause to consider.

I had a very similar hardlock on the Octane, too, when I upgraded the RAM to
3.5GB the other day, but going back to 2GB solves the problem there.  Octane
is, for all intents and purposes, a single-node Origin system w/ graphics
options, HEART instead of HUB, and a much more simplified PROM).  Both use the
same SCSI chip, a QLogic ISP1040B, and thus the same driver, qla1280.o.  The
difference with the Octane is I can reproduce the hardlock on demand by
untarring a large tarball (a Gentoo stage3, to be exact).  Compared to the
Onyx2, which has 8GB of RAM, and the lock seems more random.

I'll have to reconfigure the Octane later on with 3.5GB of RAM again, but test
BFQ, CFQ, and Deadline out to see if the hardlock happens with all three.  I
know BFQ is largely derived from the CFQ code, so if the system remains stable
with Deadline, but not CFQ or BFQ, then I know the subsystem.  Then, if it only
happens on BFQ, I'll go pester their upstream for debugging advice.

I thought it might've been filesystem related, but because the Octane is XFS
and the Onyx2 is Ext4, that eliminates that subsystem from consideration (I
hope).  On the Onyx2, I don't think I can trigger it on-demand, but I may have
found a way by running e4defrag on my large /usr partition.  So if I can pin a
cause down on the Octane, I might be able to test for that same cause on the
Onyx2 as well.  Provided it doesn't eat my filesystem...

Currently trying a 3.19-rc7 kernel out to see if the effects are any different.
 I also switched to compiling packages in a RAM filesystem for now.

>> It's currently got dual IP31 R14000 node boards (500MHz), and for the most
>> part, runs great (I'll regret the electric bill later...).  Clearly a bug,
>> though, but I am not sure where to start debugging on this platform to find
>> this bug, since I can't trigger it manually.  Even tried an NMI interrupt,
>> since this machine has an NMI handler in the kernel, but all that does is reset
>> the machine.
> 
>  The NMI exception is routed to the same vector reset is, firmware would 
> have to tell them apart (with the use of the CP0.Status.NMI bit) and then 
> call a handler supplied.  Perhaps there's a way to register such a handler 
> with the firmware -- does the kernel do it?  You could then use the 
> handler to examine the kernel state and perhaps dump it somehow.
> 
>  On MIPS processors an NMI or even a reset event does not clobber any 
> registers except from the CP0 ErrorEPC register, where the PC at the time 
> the event happened is stored, some bits in the CP0 Status register (ERL, 
> BEV, etc.), and of course the PC.  So alternatively does the firmware have 
> a way to dump registers on reset or NMI then somehow?
> 
>  For example R4k DECstations dump registers automatically, when the reset 
> button is pressed at a time when the machine operates normally (a power-up 
> reset can be told apart by the state of the CP0.Status.SR bit).

I only mentioned the NMI bit because IP27 does have an NMI handler in it, and I
can trigger it to dump some useful debugging information under normal
circumstances prior to the hardware reset.  But in this case, the kernel is so
dead at this point, that not even the NMI handler is executing.  I suspect it's
either a total hardware lockup at some level or something gets stuck in the CPU
so thoroughly, that the CPU stops processing all interrupts.

Actually on one use of 'nmi' from the MSC, something didn't get cleared right
in memory, so the booting of the PROM actually crashed and the Onyx2 dropped
into the POD debugger.  I was kinda hoping NMI would put me into the POD
debugger without clearing any memory banks, but in this instance, half of the
banks were cleared before the PROM crashed.  From POD, I can inspect memory
addresses (if I know where to look), but with half the banks cleared, there
really wasn't a point by then.

-- 
Joshua Kinard
Gentoo/MIPS
kumba@xxxxxxxxxx
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic