Re: IP30: SMP, Almost there?

Joshua Kinard <kumba@xxxxxxxxxx> · Mon, 18 May 2015 08:01:07 -0400

On 05/18/2015 01:39, Joshua Kinard wrote:
> So I've gotten the second CPU in Octane to "tick" again...somehow.  I am
> certain someone's cat went missing in the process...
> 
> Anyways, it's booting into an initramfs and dying almost immediately with
> errors from do_page_fault:
> 
> [   15.631359] do_page_fault(): sending SIGSEGV to init for invalid write
> access to 0000000000000338
> [   15.631395] epc = 0000000000478474 in busybox[400000+110000]
> [   15.631408] ra  = 000000000047843c in busybox[400000+110000]
> 
> Segmentation fau[   17.399304] Instruction bus error, epc == 000000000041c000,
> ra == 000000000041c5c8
> lt
> [   17.442702] Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x0000000a
> [   17.442702]
> [   17.470272] ---[ end Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x0000000a
> 
> 
> So after some digging around, I found this thread from way back in 2006 that
> seems almost identical:
> http://www.linux-mips.org/archives/linux-mips/2006-09/msg00169.html
> 
> However, none of the stuff regarding flush_icache_range seems to be around nor
> relevant anymore.  But I did comment out one of the #if 0's in
> arch/mips/mm/fault.c and got this output:
> [   16.755572] Cpu0[init:1:0000000000520378:1:a800000020360bfc]
> [   16.772869] Cpu0[init:1:000000007ff45fb0:1:a80000002001cec4]
> [   16.790102] Cpu0[init:1:0000000000400160:0:0000000000400160]
> [   16.807563] Cpu0[init:1:000000000041c000:0:000000000041c000]
> [   16.825027] Cpu0[init:1:0000000000521ff8:1:0000000000402380]
> [   16.842141] Cpu0[init:1:0000000000522010:1:00000000004023d8]
> [   16.859289] Cpu0[init:1:0000000000422a6c:0:0000000000422a6c]
> [   16.876768] Cpu0[init:1:000000000051fffc:0:0000000000400320]
> [   16.893915] Cpu0[init:1:00000000004ddaf4:0:00000000004ddaf4]
> [   16.911389] Cpu0[init:1:000000000094d008:1:000000000040519c]
> [   16.928527] Cpu0[init:1:00000000004e7d9b:0:0000000000404aec]
> [   16.946000] Cpu0[init:1:0000000000503cde:0:0000000000428994]
> [   16.963441] Cpu0[init:1:000000000047f2d4:0:000000000047f2d4]
> [   16.980945] Cpu0[init:1:00000000004f76e8:0:000000000047f380]
> [   16.998410] Cpu0[init:1:000000000094eff8:1:00000000004051a0]
> [   17.015596] Cpu0[init:1:000000007ff449c8:0:a80000002001d668]
> [   17.032716] Cpu0[init:1:000000007ff449d0:1:a800000020360a48]
> [   17.050655] Cpu0[init:1:000000000094fff8:1:00000000004051a0]
> [   17.068127] Cpu0[init:1:0000000000950ff8:1:00000000004051a0]
> [   17.085615] Cpu0[init:1:0000000000952ff8:1:00000000004051a0]
> [   17.102741] Cpu0[init:1:0000000000951000:1:0000000000472fc8]
> [   17.121391] Cpu0[init:1:0000000000953ff8:1:00000000004051a0]
> [   17.138756] Cpu0[init:1:0000000000954ff8:1:00000000004051a0]
> [   17.156542] Cpu0[init:1:000000007ff44de8:1:0000000000403398]
> [   15.613954] Cpu1[init:75:000000000040c1a0:0:000000000040c1a0]
> [   15.614065] Cpu1[init:75:000000007ff44de8:1:0000000000403398]
> [   15.631203] Cpu1[init:75:0000000000413b58:0:0000000000413b58]
> [   15.631276] Cpu1[init:75:000000000047843c:0:000000000047843c]
> [   15.631336] Cpu1[init:75:0000000000000338:1:0000000000478474]
> 
> The invalid address (I believe what is effectively a NULL) of
> 0x0000000000000338 is pretty consistent with the netboot.  Sometimes I get a
> panic in a mutex*slowpath function (I forget which one).  But it's way more
> predictable with this netboot than with the disks inserted.

Apparently, setting cca=5 on the kernel command line improves things.  The
netboot can load busybox ash and move around.  But booting the real userland is
still very problematic (XFS filesystem pretty much blows up on mounting root).

What is the relationship between the cache-coherency algorithm and SMP?  IP30
hardware is supposed to be cache-coherent.  A value of '5' sets the processors
to "cacheable coherent exclusive on write" (per the R10K manual).  But I am not
sure why things are still flakey.

--J