On 03/19/2017 04:55, Ralf Baechle wrote: > On Sun, Mar 19, 2017 at 03:23:39AM -0400, Joshua Kinard wrote: > >> The closest I've gotten to extracting info on the state of the machine is to >> set the MSC debug switches to 0x1018 and then issue an immediate reset to have >> it drop into POD dirty-exclusive as soon as possible. Then running "why" >> sometimes nets me a valid kernel address in EPC that tells me where the POD CPU >> was last at. Downside, I have four CPUs and MSC POD locks up if I try >> switching to any of the other CPUs. So I can't get a register dump off of the >> other three. > > Have you tried to send an NMI fro the MSC? The PoD debugger is actually > a fairly handy tool in such cases. Oddly enough, sending NMI's from POD lets me switch CPUs to execute "why" off of. The POD "cpu" command doesn't appear to work right. It'll just lock up that console instead. This allows me to grab the address in EPC from all four CPUs after a lock up and that's how I've been doing my debugging all weekend. >> 2A 000: Done initializing klconfig. >> 2A 000: Discovering NUMAlink connectivity ......... DONE >> 2A 000: Found 2 objects (2 hubs, 0 routers) in 511413 usec >> 1B 000: Testing/Initializing memory ............... DONE >> 2A 000: Waiting for peers to complete discovery.... Reading link 0 >> (addr 0x92000000 >> 2A 000: 00000004) failed >> 1B 000: CPU B switching to UALIAS >> 1B 000: CPU B now running out of UALIAS >> 2A 000: Reading link 0 (addr 0x9200000000000004) failed >> 1B 000: Skipping secondary cache diags >> 1B 000: CPU B switching stack into UALIAS and invalidating D-cache >> 1B 000: CPU B switching into node 0 cached RAM >> 1B 000: CPU B running cached >> 2A 000: Reading link 0 (addr 0x9200000000000004) failed >> 2A 000: Reading link 0 (addr 0x9200000000000004) failed > > I thought that kind of messages was indicating a hardware issue. As far as I can tell, it doesn't appear to be. A cold boot usually resolves this issue. It happens randomly, and not at all on Sunday. >> Then it gets a general exception and drops to POD Dex: >> 1B 000: Local Slave : Waiting for my NASID ... >> 1B 000: CPU B switching to UALIAS >> 1B 000: CPU B running in UALIAS >> 1B 000: CPU B Flushing and invalidating caches >> 1B 000: CPU B switching to node 0 cached RAM >> 1B 000: CPU B running cached >> 1A 000: >> 1A 000: *** General Exception on node 0 >> 1A 000: *** EPC: 0xc00000001fc473dc (0xc00000001fc473dc) >> 1A 000: *** Press ENTER to continue. >> 1A 000: POD MSC Dex> >> >> If this is a hardware lock up, that might explain why kgdb isn't useful at that >> point. POD lets me dump the CRBs and PI error spool, but I'm not sure how >> useful that information is w/o SGI's internal documents. > > I still haven't forgotten everything (I hope) so maybe you could post that > information anyway just to use the small chance there ight be something > useful in there? I hope you haven't forgotten everything. How many people left actually still care about about these platforms? :) That said, CRBs didn't contain anything useful. All initialized to zero, except CRB D, which only held 0x00000000000000ff in all 15 of its registers. So that said, here's what I spent all of Sunday and a little of Monday night doing: 1. Boot into userland, allow md to finish any resyncs, delete the dummy file "FOO" being created, sync 2. Re-run this dd command: dd if=/dev/urandom of=/usr/space/bonnie++/FOO bs=1M count=24000 status=progress If that command actually completes, it takes about ~35-40mins. In a couple of instances, it actually finished, but in most instances, the system would lock up anywhere from 5 seconds after launching to 25mins later. I even made a video capture of my two serial windows and ssh console when the system locks up that I can put onto Youtube if interested. Once the system locked up, I: 1. Switched to the MSC. 2. Set debug bits 0x1018 to force POD Dex mode. 3. Reset the system. 4. Run "why" on the first CPU to come up and record the value of EPC. 5. Issue an NMI to the other CPU on that node to force it into POD 6. Run "why" and record the value of EPC. 7. Switch to IOC3 console. 8. Issue an NMI to the other node to force it into POD on MSC. 9. Run "why" and record EPC. 10. Repeat for last CPU. 11. Clear debug bits, reset system, repeat all over again. I don't know why issuing the NMI's works to switch the active CPU in POD around. POD's own "cpu" command just locks the console up when you try to switch CPUs. Not sure if it's a bug in the POD software I have in my PROM image (which is 6.156, the latest SGI made AFAIK). I've also determined that EPC is the only register that's reliable to hold data left over from the lockup. The PROM scrambles the remaining registers. I verified this by periodically recompiling my kernel with different options or a different compiler version to make sure the function addresses were different, and EPC would still point at the addresses of very specific functions when the system locked up. Using the EPC values, I looked those up in GDB for each cycle and kept a listing of where all four CPUs were last at when the machine stopped. Going by that, I've compiled a non-exhaustive list of the most common functions, one or more of which might be where the underlying problem is at: mem_serial_in (./arch/mips/include/asm/io.h:428) ioc3_mdio_read (drivers/net/ethernet/sgi/ioc3-eth.c:478) arch_local_irq_save (arch/mips/lib/mips-atomic.c:66) arch_local_irq_restore (arch/mips/lib/mips-atomic.c:109) do_idle (kernel/sched/idle.c:154) hub_rt_read (arch/mips/sgi-ip27/ip27-timer.c:145) spin_dump (kernel/locking/spinlock_debug.c:58) The ones that stand out the most to me are arch_local_irq_restore and spin_dump. This is what GDB said about spin_dump: (gdb) l *0xa800000000096ba0 0xa800000000096ba0 is in spin_dump (kernel/locking/spinlock_debug.c:56). 51 52 static void spin_dump(raw_spinlock_t *lock, const char *msg) 53 { 54 struct task_struct *owner = NULL; 55 56 if (lock->owner && lock->owner != SPINLOCK_OWNER_INIT) 57 owner = lock->owner; 58 printk(KERN_EMERG "BUG: spinlock %s on CPU#%d, %s/%d\n", 59 msg, raw_smp_processor_id(), 60 current->comm, task_pid_nr(current)); Which tells me the system was about to report a spinlock lockup somewhere before it halted. I tried setting a kgdb breakpoint on line 56, but the machine is halting before it can actually execute the breakpoint, as in one of my runs, one of the CPUs was literally in the middle of executing "nmi_handler" from genex.S, and I only saw this when I had an active breakpoint set. So I think my suspicion of this being a spinlock lockup is on the right path, and it's grave enough that it halts the entire machine. Problem is, we're locking up so fast that existing kernel infrastructure can't report //what// spinlock is the one deadlocking. I can't rely on leftover data in the CPU's own registers, and I don't know if it's possible to dump kernel memory from POD mode in a way that can be decoded into something useful to trace down what happened. --- The other standout function is arch_local_irq_restore. This is what that looks like in GDB: 0xa800000000325148 is in arch_local_irq_restore (arch/mips/lib/mips-atomic.c:109). 104 " .set pop \n" 105 : [flags] "=r" (__tmp1) 106 : "0" (flags) 107 : "memory"); 108 109 preempt_enable(); 110 } 111 EXPORT_SYMBOL(arch_local_irq_restore); 112 113 #endif /* !CONFIG_CPU_MIPSR2 && !CONFIG_CPU_MIPSR6 */ Or the asm dump: a800000000325130 <arch_local_irq_restore>: a800000000325130: 40016000 mfc0 at,$12 a800000000325134: 30840001 andi a0,a0,0x1 a800000000325138: 3421001f ori at,at,0x1f a80000000032513c: 3821001f xori at,at,0x1f a800000000325140: 00812025 or a0,a0,at a800000000325144: 40846000 mtc0 a0,$12 a800000000325148: 03e00008 jr ra a80000000032514c: 00000000 nop Each time the machine has locked up and EPC points at arch_local_irq_restore, it's always been on the address of that "ja" instruction, regardless of which CPU I got that EPC value off of. But I don't know if that instruction is responsible for the lock up (I doubt it). I am also noticing that sometimes when a lockup happens, one or more of the other CPUs is in the middle of a read[bwlq]/write[bwlq] operation. In a test I just ran before writing this e-mail, two of the CPUs were in the middle of read/write tasks (this is from a kernel with my patchset on top, so this function is from the IOC3 metadriver, but the same problem has been seen in mem_serial_in from a vanilla lmo git tree): CPU 2B: 0xa8000000003a6138 is in ioc3_serial_in (./arch/mips/include/asm/io.h:428). 423 \ 424 __BUILD_MEMORY_PFX(__raw_, bwlq, type) \ 425 __BUILD_MEMORY_PFX(, bwlq, type) \ 426 __BUILD_MEMORY_PFX(__mem_, bwlq, type) \ 427 428 BUILDIO_MEM(b, u8) 429 BUILDIO_MEM(w, u16) 430 BUILDIO_MEM(l, u32) 431 BUILDIO_MEM(q, u64) 432 a8000000003a6120 <ioc3_serial_in>: a8000000003a6120: 908200b1 lbu v0,177(a0) a8000000003a6124: dc830020 ld v1,32(a0) a8000000003a6128: 00452804 sllv a1,a1,v0 a8000000003a612c: 38a50003 xori a1,a1,0x3 a8000000003a6130: 0065282d daddu a1,v1,a1 a8000000003a6134: 38a50003 xori a1,a1,0x3 a8000000003a6138: 90a20000 lbu v0,0(a1) a8000000003a613c: 03e00008 jr ra a8000000003a6140: 304200ff andi v0,v0,0xff a8000000003a6144: 00000000 nop CPU 2A: 0xa8000000003e6410 is in qla1280_queuecommand (./arch/mips/include/asm/io.h:429). 424 __BUILD_MEMORY_PFX(__raw_, bwlq, type) \ 425 __BUILD_MEMORY_PFX(, bwlq, type) \ 426 __BUILD_MEMORY_PFX(__mem_, bwlq, type) \ 427 428 BUILDIO_MEM(b, u8) 429 BUILDIO_MEM(w, u16) 430 BUILDIO_MEM(l, u32) 431 BUILDIO_MEM(q, u64) 432 433 #define __BUILD_IOPORT_PFX(bus, bwlq, type) \ a8000000003e6360 <qla1280_queuecommand>: a8000000003e6360: 67bdff90 daddiu sp,sp,-112 a8000000003e6364: ffbf0068 sd ra,104(sp) ... a8000000003e6404: 97c41728 lhu a0,5928(s8) a8000000003e6408: 64630078 daddiu v1,v1,120 a8000000003e640c: 38630002 xori v1,v1,0x2 a8000000003e6410: 94650000 lhu a1,0(v1) a8000000003e6414: 30a5ffff andi a1,a1,0xffff a8000000003e6418: 0085182a slt v1,a0,a1 a8000000003e641c: 14600002 bnez v1,a8000000003e6428 <qla1280_queuecommand+0xc8> --- I've also tried dumping the HUB error state after reset by setting the debug bits to 0x1000, and that says this: Erecting partition fences ................ DONE nasid 0 Reading peer hub nasid: 0x9200000021600000 xbow_update: updating slave nasid 1 on link 10 Update config for routers connected to hubs Update config for hubs and hubless routers CPU A flushing cache Hardware Error State at System Reset + Errors on node Nasid 0x0 (0) + IP27 in /hw/module/1/slot/n2 + HUB signalled following errors. + HUB error interrupt register: 0x200000 + 21: CPU A received uncorrectable error during uncached load End Hardware Error State (at System Reset) But I'm not 100% sure I can trust that, since I believe I've seen it after a clean reboot from userland as well. Both vanilla lmo git and my patched tree throw a panic if the kernel gets a HUB error interrupt. I have noticed that sometimes, when the system panics, it can halt the machine before any panic info can be written out to the console. But given multiple runs, there's still a chance one of the panics will output something. Given the number of times I've crashed this machine lately, by now, I should have gotten *something* out of it. So I am discounting it being a HUB error interrupt locking things up. Just to be sure, I replaced the panic() call with a basic printk(), and still didn't get any output on the console. I'll also have to enable the heavy diagnostics mode and let the machine check itself, and maybe run bist from POD to be really sure it's not a hardware issue. I've got spare node boards in a closest that I can swap the CPU PIMMS on if I suspect something's wrong with one of the HUBs. -- Joshua Kinard Gentoo/MIPS kumba@xxxxxxxxxx 6144R/F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic