Re: Unstable Kernel behavior on an ARM based board

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 05, 2019 at 08:11:22PM +0500, Embedded Engineer wrote:
> On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin
> <linux@xxxxxxxxxxxxxxx> wrote:
> >
> > Should've been pool->allocation.  Sorry about that.
> 
> No problems, here are the new logs:
> 
> https://pastebin.com/dfey3LwB

Thanks - the patch I posted substantially increases the amount of checking
that is done... so not surprisingly we find new forms of corruption:

tegra-ehci 7d004000.usb: pool_alloc_page ehci_qh, 0xac050240 (corrupted)
00000000: a0 02 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  ....kkkkkkkkkkkk
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000040: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

and that corruption occurred _right_ after we allocated the page, memset
the entire page to 0xa7, and wrote the "next" pointers.

Again, similar scenario to the above:

tegra-ehci 7d004000.usb: pool_alloc_page ehci_qtd, 0xac0510c0 (corrupted)
00000000: 20 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7   ...............
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000040: e0 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

which is again right after the page is allocated and initialised.

If we look at the ci_hw_qh case, which is the one originally identified:

tegra-udc 7d000000.usb: pool_alloc_page ci_hw_qh, 0xac056080 (corrupted)
00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................
00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7  ................

Again, just allocated the coherent DMA page, memset() it and written
the offsets to it, and it is already corrupted.  Tegra124 does not
appear to be dma-coherent, so these allocations will be for normal,
uncached memory.  That means the cache won't be loading entire
cachelines at a time from memory for these accesses, but will be
reading them byte by byte as we print the hex values.

The window for this corruption occuring is now very small.

Right now, I don't have anything further to add beyond what I've
already suggested as causes - this is *definitely* memory corruption
either by something else writing to memory, by the CPU writes not
properly being stored in RAM or the CPU not being able to reliably
read data back from RAM.

I wonder whether any of the memory testers run with normal, uncached
memory.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up



[Index of Archives]     [ARM Kernel]     [Linux ARM]     [Linux ARM MSM]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux