Re: IP30: SMP, Almost there!

Joshua Kinard <kumba@xxxxxxxxxx> · Mon, 01 Jun 2015 02:00:10 -0400

On 06/01/2015 01:08, Joshua Kinard wrote:
> On 05/23/2015 23:17, Joshua Kinard wrote:
>> On 05/18/2015 01:39, Joshua Kinard wrote:
>>> So I've gotten the second CPU in Octane to "tick" again...somehow.  I am
>>> certain someone's cat went missing in the process...
>>
>> So, yeah, the problem appears to be specific to the R14000 CPU module.  I
>> swapped in an R12K dual CPU module, and after a little bit of tinkering to
>> revert a few hacks and clean up the code, it boots into SMP, mounts the
>> userland, and has successfully sync'ed a Gentoo Portage tree w/o annihilating
>> the XFS filesystem or the MD RAID5 array.  Even compiled a few C files.
>>
> [snip]
>>
>> I even got the IRQs to be fanned out across both CPUs.  Well, primarily the
>> qla1280 drivers.  They randomly hop between both CPUs, but no ill effects so far.
>>
>> But if I boot that *same* working kernel on an R14000 dual module, I get handed
>> an IBE as soon as the userland mounts.  The only documented differences that I
>> can find on the R14000 is that it supports DDR memory, being able to do memory
>> operations on the rising edge and falling edge of each clock.  Not sure if that
>> matters to the kernel at all, but I know of nothing else that describes the
>> R14K's internals, such as if there's some new bit in CP0 config,
>> branch-diagnostic, status, etc, that might explain why these IBE's are happening.
>>
>> Guess I need to hunt down my old dual R10K module next and verify that works
>> fine...
>>
>> Also, is there a way to hardcode the cca=5 setting for IP30?  Maybe it needs to
>> be a hidden Kconfig item?.  I tried setting cpu->writecombine in cpu-probe.c,
>> but no dice there.  If I boot an SMP kernel on dual R12K's w/o cca=5, I'll get
>> one or two pretty-specific oopses.  The one I did grab complains about bad
>> spinlock magic in the core tty driver somewhere.  I can transcribe that oops
>> later on if interested.
> 
> So far, the problem looks to have been blindly assigning all 64 HEART IRQs to
> 'handle_level_irq', including the SMP IPI IRQs.  I fixed that by assigning the
> four IPI IRQs and four unused debug IRQs to 'handle_percpu_irq'.  So far, no
> bus errors, even on R14000.  Also successfully tested 16KB PAGE_SIZE and no bus
> errors.  Next, 64KB PAGE_SIZE w/ CONFIG_TRANSPARENT_HUGEPAGE, which was pretty
> good at triggering bus errors.
> 
> </jinx>

CONFIG_TRANSPARENT_HUGEPAGE + HUGETLBFS is still not quite right on R14K CPUs.
 I can very easily trip up bus errors with that config by running 'sync', 'ls',
or 'swapon'/'swapoff' in rapid succession in a minimal bash shell
(init=/bin/bash).  But this was doable even with a single R14K module, so it
has to be a different problem.

At least 16KB and 64KB PAGE_SIZE seem to work well enough now.  Progress!

Also, is there a clear-cut explanation of the difference between
read[bwlq]/write[bwlq] and the raw/__raw/____raw variants?  Which is safe to
use in machine code (like in the SMP or IRQ setup code) versus elsewhere?  Any
warnings, gotchas, etc one has to be aware of?

--J