On Sun, Mar 19, 2017 at 03:23:39AM -0400, Joshua Kinard wrote: > The closest I've gotten to extracting info on the state of the machine is to > set the MSC debug switches to 0x1018 and then issue an immediate reset to have > it drop into POD dirty-exclusive as soon as possible. Then running "why" > sometimes nets me a valid kernel address in EPC that tells me where the POD CPU > was last at. Downside, I have four CPUs and MSC POD locks up if I try > switching to any of the other CPUs. So I can't get a register dump off of the > other three. Have you tried to send an NMI fro the MSC? The PoD debugger is actually a fairly handy tool in such cases. > 2A 000: Done initializing klconfig. > 2A 000: Discovering NUMAlink connectivity ......... DONE > 2A 000: Found 2 objects (2 hubs, 0 routers) in 511413 usec > 1B 000: Testing/Initializing memory ............... DONE > 2A 000: Waiting for peers to complete discovery.... Reading link 0 > (addr 0x92000000 > 2A 000: 00000004) failed > 1B 000: CPU B switching to UALIAS > 1B 000: CPU B now running out of UALIAS > 2A 000: Reading link 0 (addr 0x9200000000000004) failed > 1B 000: Skipping secondary cache diags > 1B 000: CPU B switching stack into UALIAS and invalidating D-cache > 1B 000: CPU B switching into node 0 cached RAM > 1B 000: CPU B running cached > 2A 000: Reading link 0 (addr 0x9200000000000004) failed > 2A 000: Reading link 0 (addr 0x9200000000000004) failed I thought that kind of messages was indicating a hardware issue. > Then it gets a general exception and drops to POD Dex: > 1B 000: Local Slave : Waiting for my NASID ... > 1B 000: CPU B switching to UALIAS > 1B 000: CPU B running in UALIAS > 1B 000: CPU B Flushing and invalidating caches > 1B 000: CPU B switching to node 0 cached RAM > 1B 000: CPU B running cached > 1A 000: > 1A 000: *** General Exception on node 0 > 1A 000: *** EPC: 0xc00000001fc473dc (0xc00000001fc473dc) > 1A 000: *** Press ENTER to continue. > 1A 000: POD MSC Dex> > > If this is a hardware lock up, that might explain why kgdb isn't useful at that > point. POD lets me dump the CRBs and PI error spool, but I'm not sure how > useful that information is w/o SGI's internal documents. I still haven't forgotten everything (I hope) so maybe you could post that information anyway just to use the small chance there ight be something useful in there? Ralf