Some more digging I did over the weekend on the IP27 lock up issue led to some new observations. First off, I ended up changing the IRIX-style HUB_S/HUB_L macros to use __raw_readq/__raw_writeq, since I know those versions disable local interrupts, and was running out of ideas to try. That appears to have cleaned up the register state when I crash the machine and I appear to get usable data out of several registers now (but I have to send an NMI to the CPU first, so it's some neat hardware trick). This leads to only one CPU appearing to be affected either by a data bus error or getting "stuck". Stuck meaning, in a few instances, three CPUs were last in an idle loop and one was at the bottom of arch_local_irq_restore from arch/mips/lib/mips-atomic.c. In other instances, two or more CPUs were stuck in arch_local_irq_restore. In the stuck case, a constant theme I keep noticing is one of the register values holds the address 0x9200000001030100, which is PI_RT_COUNT in I/O space (Hub's realtime counter) on node 1. It looks like some kind of clash where a previous timer interrupt is leaving behind stale data in the registers, and the resumed state of the CPU tries to use that data and halts. E.g., in one crash using a mainline kernel with minor patching to make it work, this is the POD "why" and register dump off CPU 2A and 2B (node 1, or 2nd node board): 2A 000: POD MSC Dex> why 2A 000: EPC : 0xa800000000326fa8 (0xa800000000326fa8) 2A 000: ERREPC : 0xc00000001fc5ad5c (0xc00000001fc5ad5c) 2A 000: CACERR : 0x0000000000012108 2A 000: Status : 0x0000000024407c80 2A 000: BadVA : 0xc0000000009c55c8 (0xc0000000009c55c8) 2A 000: RA : 0xc00000001fc13500 (0xc00000001fc13500) 2A 000: SP : 0xa800000000103650 2A 000: A0 : 0x0000000024400080 2A 000: Cause : 0x0000000000008000 (INT:8-------) 2A 000: Reason : 249 (NMI while executing in PROM.) 2A 000: POD mode was called from: 0xc00000001fc02508 (0xc00000001fc02508) 2A 000: POD MSC Dex> pr 2A 000: r00/r0: 0x0000000000000000 r01/at: 0x0000000000000000 2A 000: r02/v0: 0x0000000000000000 r03/v1: 0x0000000000000001 2A 000: r04/a0: 0x0000000024400080 r05/a1: 0xffff920000000103 2A 000: r06/a2: 0x000000000000000a r07/a3: 0x0000000000000000 2A 000: r08/a4: 0x0000000000000000 r09/a5: 0x0000000000000020 2A 000: r10/a6: 0xa8000000001036e0 r11/a7: 0x0000000000000000 2A 000: r12/t0: 0x000000000000002a r13/t1: 0x000000000000004c 2A 000: r14/t2: 0x0000000000000068 r15/t3: 0x00000140a4436020 2A 000: r16/s0: 0x0000000000dd7210 r17/s1: 0x000000000000008c 2A 000: r18/s2: 0xffffffffffffffff r19/s3: 0xc00000001fc74d70 2A 000: r20/s4: 0x00000000000f4240 r21/s5: 0xa800000000103a0f 2A 000: r22/s6: 0x0000000000000000 r23/s7: 0x00000000000000fc 2A 000: r24/t8: 0x0000000000000001 r25/t9: 0x0000000000000001 2A 000: r26/k0: 0x9200000001220050 r27/k1: 0x000000000000001e 2A 000: r28/gp: 0xc00000001fce5028 r29/sp: 0xa800000000103650 2A 000: r30/fp: 0x0000000000000000 r31/ra: 0xc00000001fc13500 2B 000: POD MSC Dex> why 2B 000: EPC : 0xa800000000326fa8 (0xa800000000326fa8) 2B 000: ERREPC : 0xc00000001fc5ad4c (0xc00000001fc5ad4c) 2B 000: CACERR : 0x0000000008080c10 2B 000: Status : 0x0000000024407c80 2B 000: BadVA : 0xc0000000007f8768 (0xc0000000007f8768) 2B 000: RA : 0xc00000001fc398d8 (0xc00000001fc398d8) 2B 000: SP : 0xa8000000001033e0 2B 000: A0 : 0x0000000024400080 2B 000: Cause : 0x0000000000008000 (INT:8-------) 2B 000: Reason : 249 (NMI while executing in PROM.) 2B 000: POD mode was called from: 0xc00000001fc02508 (0xc00000001fc02508) 2B 000: r00/r0: 0x0000000000000000 r01/at: 0x0000000000000000 2B 000: r02/v0: 0x00000000000000fe r03/v1: 0x0000000000000001 2B 000: r04/a0: 0x0000000024400080 r05/a1: 0x9200000001030100 2B 000: r06/a2: 0x000000000000000a r07/a3: 0x0000000000000000 2B 000: r08/a4: 0x0000000000000002 r09/a5: 0xa800000000103578 2B 000: r10/a6: 0x0000000000000060 r11/a7: 0xa8000000001035d8 2B 000: r12/t0: 0x00000000000f4240 r13/t1: 0x0000000000000002 2B 000: r14/t2: 0x0000000000000001 r15/t3: 0x0000000000000001 2B 000: r16/s0: 0x9200000001a20000 r17/s1: 0x00000000000000e3 2B 000: r18/s2: 0x00000000000000eb r19/s3: 0x00000000011ad00c 2B 000: r20/s4: 0x00000000000000eb r21/s5: 0x0000000000000000 2B 000: r22/s6: 0x0000000000000000 r23/s7: 0x00000000000000fc 2B 000: r24/t8: 0x0000000000000004 r25/t9: 0x0000000000000001 2B 000: r26/k0: 0x9200000001220058 r27/k1: 0x000000000000001e 2B 000: r28/gp: 0xc00000001fce5028 r29/sp: 0xa8000000001033e0 2B 000: r30/fp: 0x0000000000000000 r31/ra: 0xc00000001fc398d8 The address 0xa800000000326fa8 in both CPU's EPC register is arch_local_irq_restore, line 109: a800000000326f90 <arch_local_irq_restore>: a800000000326f90: 40016000 mfc0 at,$12 a800000000326f94: 30840001 andi a0,a0,0x1 a800000000326f98: 3421001f ori at,at,0x1f a800000000326f9c: 3821001f xori at,at,0x1f a800000000326fa0: 00812025 or a0,a0,at a800000000326fa4: 40846000 mtc0 a0,$12 --> a800000000326fa8: 03e00008 jr ra a800000000326fac: 00000000 nop This bit of assembly only uses registers a0 and at, which in the above register dump, a0 appears to hold some older value of $12 (CP0_STATUS), and at is zero'ed out (possibly from the NMI restart). Additionally, k0 is pointing at the I/O address for a LED soldered on the node board for each CPU, and for CPU 2B, s0 holds an I/O address to HUB's MD register space, specifically the MSC UART (so both are probably junk data from my use of MSC in POD Dex mode). What's interesting to me in this specific instance is register a1 on CPU 2B holds PI_RT_COUNT (0x9200000001030100), but on CPU 2A, the same value looks like it's been overwritten by shifting right 16 bits and 63:48 filled in with 1's. That's not used by arch_local_irq_restore, but it may have been used by a previous state of the CPU, which is why I suspect the registers are getting accidentally clobbered. The constant theme is seeing PI_RT_COUNT's address popping into registers that look like they contain data from other CPU states, which suggests to me that it's some kind of race that may involve changing the CPU's interrupt state as well as the IP27 timer code, because I know that on IP27, we have one counter, but two compare registers, and the timer interrupt is always firing. However, in hub_rt_read, all reads from PI_RT_COUNT appear to be locked to the first CPU on the nodeboard via REMOTE_HUB_L: static u64 hub_rt_read(struct clocksource *cs) { return REMOTE_HUB_L(cputonasid(0), PI_RT_COUNT); } I wonder if this clashes at all with rt_next_event, in which both CPUs can access PI_RT_COUNT locally with LOCAL_HUB_L: static int rt_next_event(unsigned long delta, struct clock_event_device *evt) { unsigned int cpu = smp_processor_id(); int slice = cputoslice(cpu); unsigned long cnt; cnt = LOCAL_HUB_L(PI_RT_COUNT); cnt += delta; LOCAL_HUB_S(PI_RT_COMPARE_A + PI_COUNT_OFFSET * slice, cnt); return LOCAL_HUB_L(PI_RT_COUNT) >= cnt ? -ETIME : 0; } If you take this as a possible contention point, and then hammer the system with lots of I/O activity, such as writing out a large file to disk, and maybe keep transmitting data over serial or ethernet, you get to a point where you somehow deadlock the HUB on a specific nodeboard. Which stops the system dead because Linux doesn't currently set up detailed HUB link error reporting, so we don't get signaled about one HUB losing contact with the other HUB. Does that sound at all plausible? -- Joshua Kinard Gentoo/MIPS kumba@xxxxxxxxxx 6144R/F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic