On 03/18/2017 19:42, Joshua Kinard wrote: > > Futzing around with the load address on IP27 doesn't work the same as on > Octane. IP27 has a much smaller window of FreeMemory available versus the > Octane, based on this dump I got out of arcload: > > ARCS Memory Map > 0x0 - 0x1000 (ExceptionBlock) > 0x1000 - 0x2000 (SystemParameterBlock) > 0x19000 - 0x12f0000 (FreeMemory) > 0x12f0000 - 0x12ff000 (LoadedProgram) > 0x12ff000 - 0x1300000 (FreeMemory) > 0x1300000 - 0x1400000 (FirmwareTemporary) > 0x1400000 - 0x1500000 (FreeMemory) > 0x1500000 - 0x1800000 (FirmwareTemporary) > 0x1800000 - 0x1a00000 (FirmwareTemporary) > 0x1a00000 - 0x1b00000 (FirmwarePermanent) > 0x1c00000 - 0x1e00000 (FreeMemory) > 0x1c01000 - 0x1f66000 (FirmwareTemporary) > 0x1f80000 - 0x1fa0000 (FirmwareTemporary) > > Going by that, I was finally able to strip a kernel down small enough to > contain both CONFIG_DEBUG_LOCK_ALLOC and the absolute bare minimum > functionality to boot to login on IP27, and I have about ~3.5KB to spare. The > only thing I've seen thus far after several reboots is a single spinlock lockup > in generic code, but that was on a kernel using my patches, and I couldn't > reproduce it a second time. So I'm switching to as pure of a mainline kernel > as I can to see if I can trip things up there. > > Also trying to get kgdb to work, but something isn't right with it. Seems like > the kgdboc= boot parameter isn't being parsed/honored, so I have to force it > manually by writing to /sys/module/kgdboc/parameters/kgdboc before the SysRq-g > option becomes available. I am hoping there's nothing special I need to do to > IOC3 to get a debugger attached and working, but we'll see. The kdb frontend > appears to be out of the question, as it adds ~6-7KB of extra code. It looks like kgdb won't work with the IOC3 metadriver, but the existing IOC3 code in ioc3-eth.c that handles serial will work. I was able to get gdb on my Octane to connect to it, though one has to use ~4800 baud to make it reliable (could be the 30ft cat5 cable I'm using that dislikes 9600 baud). Looks like that whatever this deadlock issue is locks the kernel pretty hard, as even after stopping with SysRq-g and then continuing it via gdb, when the deadlock happens, I cannot break into the debugger at all. Even triggering an NMI via the MSC dumps nothing out of the kernel before the PROM resets. The closest I've gotten to extracting info on the state of the machine is to set the MSC debug switches to 0x1018 and then issue an immediate reset to have it drop into POD dirty-exclusive as soon as possible. Then running "why" sometimes nets me a valid kernel address in EPC that tells me where the POD CPU was last at. Downside, I have four CPUs and MSC POD locks up if I try switching to any of the other CPUs. So I can't get a register dump off of the other three. Other interesting note, sometimes when this deadlock happens, a soft reset doesn't work. It seems like one of the HUBs is locked up, because the PROM is unable to communicate with it: 2A 000: Done initializing klconfig. 2A 000: Discovering NUMAlink connectivity ......... DONE 2A 000: Found 2 objects (2 hubs, 0 routers) in 511413 usec 1B 000: Testing/Initializing memory ............... DONE 2A 000: Waiting for peers to complete discovery.... Reading link 0 (addr 0x92000000 2A 000: 00000004) failed 1B 000: CPU B switching to UALIAS 1B 000: CPU B now running out of UALIAS 2A 000: Reading link 0 (addr 0x9200000000000004) failed 1B 000: Skipping secondary cache diags 1B 000: CPU B switching stack into UALIAS and invalidating D-cache 1B 000: CPU B switching into node 0 cached RAM 1B 000: CPU B running cached 2A 000: Reading link 0 (addr 0x9200000000000004) failed 2A 000: Reading link 0 (addr 0x9200000000000004) failed Then it gets a general exception and drops to POD Dex: 1B 000: Local Slave : Waiting for my NASID ... 1B 000: CPU B switching to UALIAS 1B 000: CPU B running in UALIAS 1B 000: CPU B Flushing and invalidating caches 1B 000: CPU B switching to node 0 cached RAM 1B 000: CPU B running cached 1A 000: 1A 000: *** General Exception on node 0 1A 000: *** EPC: 0xc00000001fc473dc (0xc00000001fc473dc) 1A 000: *** Press ENTER to continue. 1A 000: POD MSC Dex> If this is a hardware lock up, that might explain why kgdb isn't useful at that point. POD lets me dump the CRBs and PI error spool, but I'm not sure how useful that information is w/o SGI's internal documents. -- Joshua Kinard Gentoo/MIPS kumba@xxxxxxxxxx 6144R/F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic