Re: IP30: SMP Help

Joshua Kinard <kumba@xxxxxxxxxx> · Tue, 18 Nov 2014 07:37:48 -0500

On 11/18/2014 05:05, Maciej W. Rozycki wrote:
> On Tue, 18 Nov 2014, Joshua Kinard wrote:
> 
>> What is wrong with these stack addresses?  This is the result of disabling CPU1
>> in the PROM and booting an SMP kernel.  It's like both the low 32-bits and high
>> 32-bits of the data in the CPU registers are getting merged together somehow
>> when they're added to the stack.
>>
>> I can't think of anything in Octane's code doing this.  Has anyone seen
>> something like this before?  This is likely the cause of the SIGSEGV/SIGBUS
>> signals I keep getting.
>>
>> CPU: 0 PID: 54 Comm: grep Not tainted 3.18.0-rc4 #194
>> task: a800000059b80000 ti: a8000000595c4000 task.ti: a8000000595c4000
>> $ 0   : 0000000000000000 ffffffff9004fce0 ffffffffffffffff ffffffffffffffff
>> $ 4   : 0000000077d809a0 0000000000000000 ffffffffffffffff ffffffffffffffff
>> $ 8   : 0000000000440f14 0000000000439bdc 0000000000000058 0000000000000000
>> $12   : 0000000000000000 a800000059b88aa0 a8000000200d22d0 001c450400000018
>> $16   : 0000000077d809a0 0000000077e3f000 0000000000000000 0000000000000000
>> $20   : 0000000077d803b4 0000000000000001 0000000077d82604 000000007ff808f8
>> $24   : 0018460400000000 0000000077c7aa90
>> $28   : 0000000077d88e10 000000007ff807c0 000000007ff807c0 0000000077c68bdc
>> Hi    : 0000000000061170
>> Lo    : 00000000000205d0
>> epc   : 0000000077c7ab00 0x77c7ab00
>>     Not tainted
>> ra    : 0000000077c68bdc 0x77c68bdc
>> Status: 8004fcf3    KX SX UX KERNEL EXL IE
>> Cause : 00000018
>> PrId  : 00000f24 (R14000)
>> Process grep (pid: 54, threadinfo=a8000000595c4000, task=a800000059b80000, tls=0000000077e46490)
>> Stack : 0000000000000000 77d88e1077d809a0 77d88e1077d809a0 77e3f00077d809a0
>>         7ff807e877c68bdc 0000000000000000 0000000000000009 77d88e1000000001
>>         0000000000000000 0000000200000000 7ff808700041e7a4 0000000000000000
>>         0000003d202fbf00 0043f0d07ff80830 0000003d00000003 00000004004134f4
>>         77d88e1000000000 0000000300000004 0000000200000000 0043f0d000000001
>>         0000000200000003 0000000477c34698 0043000000424fd0 000000027ff808f8
>>         77d88e1077c29488 000000007ff80ae4 0000000200000000 7ff80fc600430000
>>         00424fd000000002 7ff808b077c34748 000000027ff80fc6 0043000000424fd0
>>         77d88e107ff808f8 77d8012800403788 0000000077d5638c 0000000077e4028c
>>         000000007ff808e8 0000000000000000 0043f0d077e101dc 0000000100000000
>>         ...
>> Call Trace:
>>  (Bad stack address)
>>
>> Code: 30420040  5040000a  82020046  <03c0e821>  8f998750  00002821  8fbf0024  02002021  8fbe0020
> 
>  Is `grep' a 64-bit (n64 or n32) process?  If no then, there is nothing 
> wrong here, 32-bit (o32) processes will store registers on the stack as 
> 32-bit quantities.  I doubt that has anything to do with SIGSEGV/SIGBUS.
> 
>  There is definitely something wrong here though, the contents of 
> registers include pointers to the kernel-only XKPHYS memory segment ($13 
> and $14) that shouldn't have leaked from the kernel, so it looks to me 
> like the user context isn't handled correctly.  Of course any attempt to 
> dereference these pointers will cause an exception and in the response the 
> process will be treated with an appropriate signal, and, usually, killed.
> 
>   Maciej

This is an o32 userland.  So that means, given 64-bit wide registers, o32 is
going to stuff two 32-bit quantities into them?  I have an n32 chroot on a
different partition, but I haven't tried it w/ CONFIG_SMP yet.

Of the two XKPHYS addresses, a8000000200d22d0 points directly at SyS_munmap.
Couple other crashes pointed at compat_SyS_fcntl64, as well as a few other
addresses in XKPHYS that I couldn't find a specific function for in System.map.
 Seems to be random leakage.

I thought it might be improper use of spinlocks (w/ & w/o irqsave/irqrestore)
in the IRQ code, but I commented out all spinlocks in the core IP30 code, then
after still triggering fatal crashes, commented out all of the spinlocks in
IOC3 (evil driver) and Impact (video driver).  For a while, I couldn't crash
the kernel until I uncommented Impact's spinlocks, but it looks like that was a
fluke after subsequently commenting them out again and still crashing.  I'll
probably swap in the Odyssey board in the next day or so and see if that
exhibits similar problems, just to rule out the framebuffer code/drivers.  Then
pull more memory.  Then my hair...

--J