On Wed, Apr 29, 2009 at 10:33:49AM +0200, Ralf Baechle wrote: > On Wed, Apr 29, 2009 at 10:25:56AM +0200, Manuel Lauss wrote: > > > > > (gdb) disass 0x8042f0f8 > > > > Dump of assembler code for function futex_init: > > > > 0x8042f0dc <futex_init+0>: lw v1,20(gp) > > > > 0x8042f0e0 <futex_init+4>: addiu v1,v1,1 > > > > 0x8042f0e4 <futex_init+8>: sw v1,20(gp) > > > > 0x8042f0e8 <futex_init+12>: lw v0,24(gp) > > > > 0x8042f0ec <futex_init+16>: andi v0,v0,0x4 > > > > 0x8042f0f0 <futex_init+20>: bnez v0,0x8042f114 <futex_init+56> > > > > 0x8042f0f4 <futex_init+24>: li a0,-14 > > > > 0x8042f0f8 <futex_init+28>: ll a0,0(v0) > > > > > > So this is in futex_atomic_cmpxchg_inatomic which has been inlined into > > > futex_init. The epc is pointing to this LL instruction which is a > > > legitimate MIPS32 instruction, so a reserved instruction exception does > > > not make sense. However, a NULL pointer has intensionally been passed > > > as the argument heres so this LL instruction will take a TLB exception, > > > do_page_fault() will change the EPC to return to to point to the fixup > > > handler which in the sources are these lines: > > > > > > " .section .fixup,\"ax\" \n" > > > "4: li %0, %5 \n" > > > " j 3b \n" > > > " .previous \n" > > > " .section __ex_table,\"a\" \n" > > > " "__UA_ADDR "\t1b, 4b \n" > > > " "__UA_ADDR "\t2b, 4b \n" > > > " .previous \n" > > > > > > That's how it normally should function. If however in the exception > > > handler something goes wrong while c0_status.exl is still set the c0_epc > > > regiser won't be updated for the 2nd exception which is that reserved > > > instruction exception. This sort of bug can be ugly to chase, I'm afraid. > > > > Thanks for this info! In other words, this oops is actually the result of > > another earlier problem, which trashes something used by the tlb fault > > handler? (I've also seen this oops as a "kernel unaligned access" with epc > > at the 'll'. Also, isn't it a problem that a0 is -14 instead of zero?). > > No - it will be overwritten either after the load succeeded or in the > fixup handler. The load of the -14 value is from __access_() happens to > be in a branch delay slot of a branch which will never be executed - but > that's as far as gcc knows how to optimize the access_ok() invokation > away. > > When did this issue start? I wonder if it was when you removed the Alchemy > hazard barriers? No; it started shortly after 2.6.30 was opened and I added TSC2007 support to my board. I don't see it on the Db1200, only on systems at work. I suspect it's parts of the board code which trigger it; I just can't figure out which (i.e. its just as any other board code with lots of platform device and resource structs spread over a few files). I've been running kernels before 2.6.28 came out with the removed hazard barriers and never before ran into problems. I don't think they're responsible. But I'll revert them and keep looking for the real reason ;-) Thanks! Manuel Lauss