Re: aarch64 Kernel Panic Asynchronous SError Interrupt on large file IO

Robin Murphy <robin.murphy@xxxxxxx> · Sun, 6 Oct 2019 00:45:23 +0100

On 2019-08-19 11:43 am, Will Deacon wrote:
On Mon, Aug 19, 2019 at 11:07:14AM +0100, Catalin Marinas wrote:
On Sat, Aug 17, 2019 at 03:12:41PM +0200, Philipp Richter wrote:
I added "memtest=4" to the kernel cmdline and I'm getting very quicky
a "Internal error: synchronous external abort" panic.
[...]
[    0.000000] early_memtest: # of tests: 4
[    0.000000]   0x0000000000200000 - 0x0000000002080000 pattern aaaaaaaaaaaaaaaa
[    0.000000]   0x0000000003a95000 - 0x00000000f8400000 pattern aaaaaaaaaaaaaaaa
[    0.000000] Internal error: synchronous external abort: 96000210 [#1] SMP

At least it's a synchronous error ;).

[    0.000000] pc : early_memtest+0x16c/0x23c
[...]
[    0.000000] Code: d2800002 d2800001 eb0400bf 54000309 (f9400080)

decodecode says:

    0:   d2800002        mov     x2, #0x0                        // #0
    4:   d2800001        mov     x1, #0x0                        // #0
    8:   eb0400bf        cmp     x5, x4
    c:   54000309        b.ls    0x6c  // b.plast
   10:*  f9400080        ldr     x0, [x4]                <-- trapping instruction

I guess that's the read of *p in memtest(). Writing *p probably
generates asynchronous errors it you haven't seen it yet.

Is my board completely broken ? :(

One possibility is that you don't have any memory where you think there
is, so the mapping just doesn't translate to any valid physical
location.

Can you add some printk(addr) in do_sea() to see if it always faults on
the same address?

Alternatively, just run it a few more times and see if the register dump
changes. Currently we've got:

[    0.000000] x5 : ffff8000f8400000 x4 : ffff800008400000
[    0.000000] x3 : 0000000008400000 x2 : 0000000000000000
[    0.000000] x1 : 0000000000000000 x0 : aaaaaaaaaaaaaaaa

so I'd guess that x3 is the faulting pa. The faulting (linear) VAs in the
originl report were 0xffff800009c74aa8 and 0xffff800009c08390, which is
still a way way off from this one :/

Looking at the TRM for the rk3328, there's 4gb of ram starting at pa 0x0,
so maybe some of it has been configured as secure or the memory controller
hasn't been properly initialised?

FWIW I've noticed my RK3399 board doing this too, now that I've started 
using it in anger. I'm using a hacky firmware comprising upstream U-Boot 
munged with the Rockchip miniloader and downstream Trusted Firmware 
binaries, and it looks like that mismatch is the root of this problem. 
Booting a different image based on the BSP U-boot shows that that's 
passing a memory node with the range 0x8400000-0x9600000 entirely carved 
out, so this is presumably claimed by the secure firmware/TEE and set to 
abort Non-Secure accesses.

Robin.

_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/linux-rockchip