Re: ARCS can't load CONFIG_DEBUG_LOCK_ALLOC kernel

Ralf Baechle <ralf@xxxxxxxxxxxxxx> · Sun, 19 Mar 2017 09:55:04 +0100

On Sun, Mar 19, 2017 at 03:23:39AM -0400, Joshua Kinard wrote:

> The closest I've gotten to extracting info on the state of the machine is to
> set the MSC debug switches to 0x1018 and then issue an immediate reset to have
> it drop into POD dirty-exclusive as soon as possible.  Then running "why"
> sometimes nets me a valid kernel address in EPC that tells me where the POD CPU
> was last at.  Downside, I have four CPUs and MSC POD locks up if I try
> switching to any of the other CPUs.  So I can't get a register dump off of the
> other three.

Have you tried to send an NMI fro the MSC?  The PoD debugger is actually
a fairly handy tool in such cases.

> 2A 000: Done initializing klconfig.
> 2A 000: Discovering NUMAlink connectivity .........             DONE
> 2A 000: Found 2 objects (2 hubs, 0 routers) in 511413 usec
> 1B 000: Testing/Initializing memory ...............             DONE
> 2A 000: Waiting for peers to complete discovery....             Reading link 0
> (addr 0x92000000
> 2A 000: 00000004) failed
> 1B 000: CPU B switching to UALIAS
> 1B 000: CPU B now running out of UALIAS
> 2A 000: Reading link 0 (addr 0x9200000000000004) failed
> 1B 000: Skipping secondary cache diags
> 1B 000: CPU B switching stack into UALIAS and invalidating D-cache
> 1B 000: CPU B switching into node 0 cached RAM
> 1B 000: CPU B running cached
> 2A 000: Reading link 0 (addr 0x9200000000000004) failed
> 2A 000: Reading link 0 (addr 0x9200000000000004) failed

I thought that kind of messages was indicating a hardware issue.

> Then it gets a general exception and drops to POD Dex:
> 1B 000: Local Slave : Waiting for my NASID ...
> 1B 000: CPU B switching to UALIAS
> 1B 000: CPU B running in UALIAS
> 1B 000: CPU B Flushing and invalidating caches
> 1B 000: CPU B switching to node 0 cached RAM
> 1B 000: CPU B running cached
> 1A 000:
> 1A 000: *** General Exception on node 0
> 1A 000: *** EPC: 0xc00000001fc473dc (0xc00000001fc473dc)
> 1A 000: *** Press ENTER to continue.
> 1A 000: POD MSC Dex>
> 
> If this is a hardware lock up, that might explain why kgdb isn't useful at that
> point.  POD lets me dump the CRBs and PI error spool, but I'm not sure how
> useful that information is w/o SGI's internal documents.

I still haven't forgotten everything (I hope) so maybe you could post that
information anyway just to use the small chance there ight be something
useful in there?

  Ralf