Re: oops in futex_init()

Manuel Lauss <mano@xxxxxxxxxxxxxxxxxxxxxxx> · Wed, 29 Apr 2009 13:40:42 +0200

On Wed, Apr 29, 2009 at 10:33:49AM +0200, Ralf Baechle wrote:
> On Wed, Apr 29, 2009 at 10:25:56AM +0200, Manuel Lauss wrote:
> 
> > > > (gdb) disass 0x8042f0f8
> > > > Dump of assembler code for function futex_init:
> > > > 0x8042f0dc <futex_init+0>:      lw      v1,20(gp)
> > > > 0x8042f0e0 <futex_init+4>:      addiu   v1,v1,1
> > > > 0x8042f0e4 <futex_init+8>:      sw      v1,20(gp)
> > > > 0x8042f0e8 <futex_init+12>:     lw      v0,24(gp)
> > > > 0x8042f0ec <futex_init+16>:     andi    v0,v0,0x4
> > > > 0x8042f0f0 <futex_init+20>:     bnez    v0,0x8042f114 <futex_init+56>
> > > > 0x8042f0f4 <futex_init+24>:     li      a0,-14
> > > > 0x8042f0f8 <futex_init+28>:     ll      a0,0(v0)
> > > 
> > > So this is in futex_atomic_cmpxchg_inatomic which has been inlined into
> > > futex_init.  The epc is pointing to this LL instruction which is a
> > > legitimate MIPS32 instruction, so a reserved instruction exception does
> > > not make sense.  However, a NULL pointer has intensionally been passed
> > > as the argument heres so this LL instruction will take a TLB exception,
> > > do_page_fault() will change the EPC to return to to point to the fixup
> > > handler which in the sources are these lines:
> > > 
> > >                 "       .section .fixup,\"ax\"                          \n"
> > >                 "4:     li      %0, %5                                  \n"
> > >                 "       j       3b                                      \n"
> > >                 "       .previous                                       \n"
> > >                 "       .section __ex_table,\"a\"                       \n"
> > >                 "       "__UA_ADDR "\t1b, 4b                            \n"
> > >                 "       "__UA_ADDR "\t2b, 4b                            \n"
> > >                 "       .previous                                       \n"
> > > 
> > > That's how it normally should function.  If however in the exception
> > > handler something goes wrong while c0_status.exl is still set the c0_epc
> > > regiser won't be updated for the 2nd exception which is that reserved
> > > instruction exception.  This sort of bug can be ugly to chase, I'm afraid.
> > 
> > Thanks for this info! In other words, this oops is actually the result of
> > another earlier problem, which trashes something used by the tlb fault
> > handler? (I've also seen this oops as a "kernel unaligned access" with epc
> > at the 'll'.  Also, isn't it a problem that a0 is -14 instead of zero?).
> 
> No - it will be overwritten either after the load succeeded or in the
> fixup handler.  The load of the -14 value is from __access_() happens to
> be in a branch delay slot of a branch which will never be executed - but
> that's as far as gcc knows how to optimize the access_ok() invokation
> away.
> 
> When did this issue start?  I wonder if it was when you removed the Alchemy
> hazard barriers?

No; it started shortly after 2.6.30 was opened and I added TSC2007 support
to my board.  I don't see it on the Db1200, only on systems at work.
I suspect it's parts of the board code which trigger it; I just can't figure
out which (i.e. its just as any other board code with lots of platform
device and resource structs spread over a few files).

I've been running kernels before 2.6.28 came out with the removed hazard
barriers and never before ran into problems.  I don't think they're
responsible. But I'll revert them and keep looking for the real reason ;-)

Thanks!
	Manuel Lauss