Re: ARCS can't load CONFIG_DEBUG_LOCK_ALLOC kernel

Joshua Kinard <kumba@xxxxxxxxxx> · Sun, 19 Mar 2017 03:23:39 -0400

On 03/18/2017 19:42, Joshua Kinard wrote:
> 
> Futzing around with the load address on IP27 doesn't work the same as on
> Octane.  IP27 has a much smaller window of FreeMemory available versus the
> Octane, based on this dump I got out of arcload:
> 
> ARCS Memory Map
> 0x0 - 0x1000 (ExceptionBlock)
> 0x1000 - 0x2000 (SystemParameterBlock)
> 0x19000 - 0x12f0000 (FreeMemory)
> 0x12f0000 - 0x12ff000 (LoadedProgram)
> 0x12ff000 - 0x1300000 (FreeMemory)
> 0x1300000 - 0x1400000 (FirmwareTemporary)
> 0x1400000 - 0x1500000 (FreeMemory)
> 0x1500000 - 0x1800000 (FirmwareTemporary)
> 0x1800000 - 0x1a00000 (FirmwareTemporary)
> 0x1a00000 - 0x1b00000 (FirmwarePermanent)
> 0x1c00000 - 0x1e00000 (FreeMemory)
> 0x1c01000 - 0x1f66000 (FirmwareTemporary)
> 0x1f80000 - 0x1fa0000 (FirmwareTemporary)
> 
> Going by that, I was finally able to strip a kernel down small enough to
> contain both CONFIG_DEBUG_LOCK_ALLOC and the absolute bare minimum
> functionality to boot to login on IP27, and I have about ~3.5KB to spare.  The
> only thing I've seen thus far after several reboots is a single spinlock lockup
> in generic code, but that was on a kernel using my patches, and I couldn't
> reproduce it a second time.  So I'm switching to as pure of a mainline kernel
> as I can to see if I can trip things up there.
> 
> Also trying to get kgdb to work, but something isn't right with it.  Seems like
> the kgdboc= boot parameter isn't being parsed/honored, so I have to force it
> manually by writing to /sys/module/kgdboc/parameters/kgdboc before the SysRq-g
> option becomes available.  I am hoping there's nothing special I need to do to
> IOC3 to get a debugger attached and working, but we'll see.  The kdb frontend
> appears to be out of the question, as it adds ~6-7KB of extra code.

It looks like kgdb won't work with the IOC3 metadriver, but the existing IOC3
code in ioc3-eth.c that handles serial will work.  I was able to get gdb on my
Octane to connect to it, though one has to use ~4800 baud to make it reliable
(could be the 30ft cat5 cable I'm using that dislikes 9600 baud).

Looks like that whatever this deadlock issue is locks the kernel pretty hard,
as even after stopping with SysRq-g and then continuing it via gdb, when the
deadlock happens, I cannot break into the debugger at all.  Even triggering an
NMI via the MSC dumps nothing out of the kernel before the PROM resets.

The closest I've gotten to extracting info on the state of the machine is to
set the MSC debug switches to 0x1018 and then issue an immediate reset to have
it drop into POD dirty-exclusive as soon as possible.  Then running "why"
sometimes nets me a valid kernel address in EPC that tells me where the POD CPU
was last at.  Downside, I have four CPUs and MSC POD locks up if I try
switching to any of the other CPUs.  So I can't get a register dump off of the
other three.

Other interesting note, sometimes when this deadlock happens, a soft reset
doesn't work.  It seems like one of the HUBs is locked up, because the PROM is
unable to communicate with it:

2A 000: Done initializing klconfig.
2A 000: Discovering NUMAlink connectivity .........             DONE
2A 000: Found 2 objects (2 hubs, 0 routers) in 511413 usec
1B 000: Testing/Initializing memory ...............             DONE
2A 000: Waiting for peers to complete discovery....             Reading link 0
(addr 0x92000000
2A 000: 00000004) failed
1B 000: CPU B switching to UALIAS
1B 000: CPU B now running out of UALIAS
2A 000: Reading link 0 (addr 0x9200000000000004) failed
1B 000: Skipping secondary cache diags
1B 000: CPU B switching stack into UALIAS and invalidating D-cache
1B 000: CPU B switching into node 0 cached RAM
1B 000: CPU B running cached
2A 000: Reading link 0 (addr 0x9200000000000004) failed
2A 000: Reading link 0 (addr 0x9200000000000004) failed

Then it gets a general exception and drops to POD Dex:
1B 000: Local Slave : Waiting for my NASID ...
1B 000: CPU B switching to UALIAS
1B 000: CPU B running in UALIAS
1B 000: CPU B Flushing and invalidating caches
1B 000: CPU B switching to node 0 cached RAM
1B 000: CPU B running cached
1A 000:
1A 000: *** General Exception on node 0
1A 000: *** EPC: 0xc00000001fc473dc (0xc00000001fc473dc)
1A 000: *** Press ENTER to continue.
1A 000: POD MSC Dex>

If this is a hardware lock up, that might explain why kgdb isn't useful at that
point.  POD lets me dump the CRBs and PI error spool, but I'm not sure how
useful that information is w/o SGI's internal documents.

-- 
Joshua Kinard
Gentoo/MIPS
kumba@xxxxxxxxxx
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic