On 06/01/2015 01:08, Joshua Kinard wrote: > On 05/23/2015 23:17, Joshua Kinard wrote: >> On 05/18/2015 01:39, Joshua Kinard wrote: >>> So I've gotten the second CPU in Octane to "tick" again...somehow. I am >>> certain someone's cat went missing in the process... >> >> So, yeah, the problem appears to be specific to the R14000 CPU module. I >> swapped in an R12K dual CPU module, and after a little bit of tinkering to >> revert a few hacks and clean up the code, it boots into SMP, mounts the >> userland, and has successfully sync'ed a Gentoo Portage tree w/o annihilating >> the XFS filesystem or the MD RAID5 array. Even compiled a few C files. >> > [snip] >> >> I even got the IRQs to be fanned out across both CPUs. Well, primarily the >> qla1280 drivers. They randomly hop between both CPUs, but no ill effects so far. >> >> But if I boot that *same* working kernel on an R14000 dual module, I get handed >> an IBE as soon as the userland mounts. The only documented differences that I >> can find on the R14000 is that it supports DDR memory, being able to do memory >> operations on the rising edge and falling edge of each clock. Not sure if that >> matters to the kernel at all, but I know of nothing else that describes the >> R14K's internals, such as if there's some new bit in CP0 config, >> branch-diagnostic, status, etc, that might explain why these IBE's are happening. >> >> Guess I need to hunt down my old dual R10K module next and verify that works >> fine... >> >> Also, is there a way to hardcode the cca=5 setting for IP30? Maybe it needs to >> be a hidden Kconfig item?. I tried setting cpu->writecombine in cpu-probe.c, >> but no dice there. If I boot an SMP kernel on dual R12K's w/o cca=5, I'll get >> one or two pretty-specific oopses. The one I did grab complains about bad >> spinlock magic in the core tty driver somewhere. I can transcribe that oops >> later on if interested. > > So far, the problem looks to have been blindly assigning all 64 HEART IRQs to > 'handle_level_irq', including the SMP IPI IRQs. I fixed that by assigning the > four IPI IRQs and four unused debug IRQs to 'handle_percpu_irq'. So far, no > bus errors, even on R14000. Also successfully tested 16KB PAGE_SIZE and no bus > errors. Next, 64KB PAGE_SIZE w/ CONFIG_TRANSPARENT_HUGEPAGE, which was pretty > good at triggering bus errors. > > </jinx> CONFIG_TRANSPARENT_HUGEPAGE + HUGETLBFS is still not quite right on R14K CPUs. I can very easily trip up bus errors with that config by running 'sync', 'ls', or 'swapon'/'swapoff' in rapid succession in a minimal bash shell (init=/bin/bash). But this was doable even with a single R14K module, so it has to be a different problem. At least 16KB and 64KB PAGE_SIZE seem to work well enough now. Progress! Also, is there a clear-cut explanation of the difference between read[bwlq]/write[bwlq] and the raw/__raw/____raw variants? Which is safe to use in machine code (like in the SMP or IRQ setup code) versus elsewhere? Any warnings, gotchas, etc one has to be aware of? --J