Re: IP30: SMP, Almost there?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 22, 2015 at 01:01:01PM +0100, Maciej W. Rozycki wrote:

> > Where I am lost is, though, why would I get an IBE on a 'beqz' instruction?
> 
>  A bus error is an external event, a signal asserted to the CPU by bus 
> logic on a failed read cycle.  Whether you get a Data or Instruction Bus 
> Error exception (DBE vs IBE) merely depends on whether it was a data read 
> or an instruction fetch cycle.  The class of the error is only resolved by 
> the CPU internally as obviously any external logic does not know the 
> reason the CPU put the read cycle on the bus for that failed.  Note that 
> the read cycle might well be a part of a cache fill.
> 
>  As it is a failure of a read that causes a bus error, it does not matter 
> whether the instruction that was supposed to be fetched is valid or not.  
> It has never been successfully fetched let alone decoded.  For an invalid 
> instruction that has been fetched and decoded you'd get a Reserved 
> Instruction exception instead.
> 
>  A typical reason for a bus error is a bus timeout, where no target on the 
> bus responded to a cycle, a parity error of data presented on the bus or 
> an uncorrected (multi-bit) memory access ECC error, driven by the memory 
> controller in parallel to data presented.
> 
>  NB bus errors on write cycles, such as a bus timeout or an ECC error on a 
> partial memory update (e.g. an uncached byte write), are asynchronous and 
> normally do not cause a Bus Error exception.  A hardware interrupt is 
> typically issued instead.
> 
> > My guess is there's still something not kosher with icache flushing somewhere.
> 
>  That would be odd.  Even if the state of the cache was inconsistent, I'd 
> expect a Cache Error exception at worst, and rubbish returned typically, 
> rather than a Bus Error exception.
> 
> > Anyone got ideas?  Is there some way to dump the contents of the icache and/or
> > dcache for debugging?
> 
>  I'd rather expect an uncorrected ECC error being the cause here, maybe 
> you need to clean the contacts of your memory modules.  From user 
> documentation, such as a maintenance manual that should be available for 
> your system, you might be able to infer which memory module the physical 
> address of 0x200ff12c corresponds to and start by cleaning that module 
> first.  Try to strip the system as much as possible and e.g. run with a 
> single known-good memory module only (or whatever number of modules is the 
> minimum).  Run any extra system diagnostics if provided by the firmware.
> 
>  It's interesting to note in the log you provided:
> 
> > [     1.169048] Instruction bus error, epc == 00000000004289ac, ra == 000000000047d054
> > [     1.183979] Instruction bus error, epc == 00000000004289ac, ra == 000000000047d054
> > [     1.195707] Instruction bus error, epc == 000000000040448c, ra == 0000000000404440
> > [     1.206829] Instruction bus error, epc == a8000000200ff12c, ra == a800000020104fec
> 
> that the error always happens in the same 4th word (address ending with 
> 0xc) of a 16-byte span.  Which may indeed mean there's an issue with a 
> particular memory module that supplies data for this word (assuming your 
> system has a 128-bit memory controller data bus with 64-bit DRAM modules 
> arranged in pairs and individually supplying data for each half of the bus 
> or suchlike).
> 
>  Then checking and possibly tightening the power supply connection might 
> be a good idea too.  Other connections may be worth checking, e.g. the CPU 
> daughtercard(s) if applicable.  Also any problems with overheating like a 
> loose heatsink, a blocked ventilation shaft and suchlike.  I'd definitely 
> double-check memory first though.
> 
>  If that did not help, then I'd start suspecting your system is faulty. :(

He might run IRIX on it for testing.  Also I think one of the BSDs has
support.

Octane is a close relative of the IP27 which does ECC anything an all,
all addresses fully decoded.  So if software does something stupid,
hardware will notice, quickly though not necessarily in very obvious
ways.

Some of IP27's reactions are a bit unobvious though.  First, the uncached
addres space (CCA 2) works differently that one might think.  IP27 uses
the R10000's uncached attribute feature which subdivides the CPUs
uncached XKPHYS address space into four addres spaces with the highest
address byte being 0x90, 0x92, 0x94 or 0x96.  The classic uncached
memory access happens with UC=3, that is the top address byte being
0x96.

Do not use that.  EVER.  It entirely bypasses the CPU's cache coherency
logic.  Due to all the consistency checking between the directory
caches and other involved agents the memory controller might detect the
inconsistency between cache and memory and send guess what, a bus
error.

For I/O purpose UC attribute value 1 is used, that is top byte 0x92.
UC values 0 and 2 allow direct manipulation of the directory caches
and atomic operations without the need to read the line into the CPU.

So that's what IP27 does.  Not sure how much of this behavious its
little brother IP30 has copied.

  Ralf





[Index of Archives]     [Linux MIPS Home]     [LKML Archive]     [Linux ARM Kernel]     [Linux ARM]     [Linux]     [Git]     [Yosemite News]     [Linux SCSI]     [Linux Hams]

  Powered by Linux