Re: IP30: SMP, Almost there?

Joshua Kinard <kumba@xxxxxxxxxx> · Sat, 23 May 2015 18:57:50 -0400

On 05/22/2015 12:38, Ralf Baechle wrote:
> On Thu, May 21, 2015 at 02:00:09AM -0400, Joshua Kinard wrote:
> 
>> Where I am lost is, though, why would I get an IBE on a 'beqz' instruction?
>> It's a valid instruction from MIPS-I ('beqz' is just 'beq' w/ $0 as rt).  the
>> R10K Manual states this:
>>
>> """
>> A Bus Error exception occurs when a processor block read, upgrade, or
>> double/single/partial-word read request receives an external ERR completion
>> response, or a processor double/single/partial-word read request receives an
>> external ACK completion response where the associated external
>> double/single/partial-word data response contains an uncorrectable error. This
>> exception is not maskable.
>> """
>>
>> My guess is there's still something not kosher with icache flushing somewhere.
>>  I can reboot this kernel multiple times and not always get the same IBE.  Most
> 
> Not or improperly flush the I-cache will result in stale instructions
> getting executed.  An IBE error otoh is the result of a bus error being
> signalled for the CPU's attempt to load instructions from memory.  With
> the exception of a few special cases I-cache flushing doesn't happen
> when eecuting kernel code, but only for userland and it's also somewhat
> unlikely for improper I-cache flushing to result in an IBE error.

Well, the IBE's are happening in userland, loading init, on CPU1.  I hacked
together a basic bus error handler from IP27's and using that, instead of
seeing four IBE's in a row, I can get CPU1 to stall and dump whatever debug
data I want.  Downside is, I've only got the Odyssey Early console available,
so I have to take pictures of the debug text or oops data, then manually type
it into a text file.

Further experimenting with a dual R12K module suggests that whatever the
problem is, it's got something to do with the R14K.  I'm having better success
with the R12K dual module thus far.  More on that later...

> A huge problem tracking down the cause of a bus error is that they're
> getting signalled by an external agent that is they are not generated by
> the CPU itself and there may be a significant delay until the CPU
> actually takes the exception.  In my experience the EPC is practically
> always worthless in tracking down the cause of the bus error.  Details
> depend on circumstances, as usual.

I thought that agent might be HEART, but the HEART_CAUSE register reads
0x00000000 when an IBE happens, which means no issues from its end.

How does one probe the SysAD bus?  The R10K documentation has some breakdown of
the bit format of SysAD messages.  Is there a memory address somewhere that can
be used to read data off the bus or even talk to it to get error information
(like, does it have a CAUSE register or something)?

Otherwise, figuring out what's wrong with the R14K is going to take a long time...

--J