On Tue, Mar 05, 2019 at 08:11:22PM +0500, Embedded Engineer wrote: > On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin > <linux@xxxxxxxxxxxxxxx> wrote: > > > > Should've been pool->allocation. Sorry about that. > > No problems, here are the new logs: > > https://pastebin.com/dfey3LwB Thanks - the patch I posted substantially increases the amount of checking that is done... so not surprisingly we find new forms of corruption: tegra-ehci 7d004000.usb: pool_alloc_page ehci_qh, 0xac050240 (corrupted) 00000000: a0 02 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ....kkkkkkkkkkkk 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ 00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ 00000040: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ 00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ and that corruption occurred _right_ after we allocated the page, memset the entire page to 0xa7, and wrote the "next" pointers. Again, similar scenario to the above: tegra-ehci 7d004000.usb: pool_alloc_page ehci_qtd, 0xac0510c0 (corrupted) 00000000: 20 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ............... 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ 00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ 00000040: e0 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ 00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ which is again right after the page is allocated and initialised. If we look at the ci_hw_qh case, which is the one originally identified: tegra-udc 7d000000.usb: pool_alloc_page ci_hw_qh, 0xac056080 (corrupted) 00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ 00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 ................ Again, just allocated the coherent DMA page, memset() it and written the offsets to it, and it is already corrupted. Tegra124 does not appear to be dma-coherent, so these allocations will be for normal, uncached memory. That means the cache won't be loading entire cachelines at a time from memory for these accesses, but will be reading them byte by byte as we print the hex values. The window for this corruption occuring is now very small. Right now, I don't have anything further to add beyond what I've already suggested as causes - this is *definitely* memory corruption either by something else writing to memory, by the CPU writes not properly being stored in RAM or the CPU not being able to reliably read data back from RAM. I wonder whether any of the memory testers run with normal, uncached memory. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up