On 15-09-22 13:42:03, Peter Chen wrote: > On Tue, Sep 22, 2015 at 11:52:52AM +0530, maitysanchayan@xxxxxxxxx wrote: > > On 15-09-22 07:36:01, Peter Chen wrote: > > > On Mon, Sep 21, 2015 at 06:56:34PM +0530, maitysanchayan@xxxxxxxxx wrote: > > > > On 15-09-21 14:50:18, Peter Chen wrote: > > > > > On Fri, Sep 18, 2015 at 04:01:50PM +0530, maitysanchayan@xxxxxxxxx wrote: > > > > > > On 15-09-18 13:39:11, Peter Chen wrote: > > > > > > > On Wed, Sep 16, 2015 at 02:48:50PM +0530, maitysanchayan@xxxxxxxxx wrote: > > > > > > > > On 15-09-16 15:54:21, Peter Chen wrote: > > > > > > > > > On Wed, Sep 16, 2015 at 02:18:49PM +0530, maitysanchayan@xxxxxxxxx wrote: > > > > > > > > > > Hello Peter, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Enable CONFIG_DEBUG_LIST, it has below position if you > > > > > > > > > > > run make menuconfig > > > > > > > > > > > Kernel hacking ---> > > > > > > > > > > > [*] Debug linked list manipulation > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sorry for the delay. When I enabled this config the first time my test > > > > > > > > > > application ran for 24 hours or so and I did not get any stack traces. > > > > > > > > > > > > > > > > > > > > I restarted the test again and finally got the trace below. You were > > > > > > > > > > spot on, its a list corruption issue. I modified the trace a bit after > > > > > > > > > > copying to remove the sprinkled debug messages throughout the trace > > > > > > > > > > from my test application. > > > > > > > > > > > > > > > > > > > > [ 622.204134] WARNING: CPU: 0 PID: 0 at lib/list_debug.c:59 __list_del_entry+0xc4/0xe8() > > > > > > > > > > [ 622.212870] list_del corruption. prev->next should be 8db63600, but was 36008db6 > > > > > > > > > > > > > > > > > > You see the higher 16 bits were swapped with lower 16 bits, and the > > > > > > > > > virtual memory address should begin from 0x8xxxxxxxx, right? > > > > > > > > > > > > > > > > Yes, I saw that but beats me how this happens. > > > > > > > > > > > > > > > > > > > > > > > > > > Check with Vybrid errata to see if all ARM/memory system have applied. > > > > > > > > > > > > > > > > What do you mean by "all ARM/memory system have applied" ? I checked with the Vybrid errata > > > > > > > > and I do not see anything related. > > > > > > > > > > > > > > > > > > > > > > Just system level errata, like ARM Cortex A5, memory (L1/L2 Cache), etc. > > > > > > > > > > > > > > Would you please do more tests to see if the error pattern is always > > > > > > > the same? > > > > > > > > > > > > I got more or less the same logs as below the last five times I tried today > > > > > > and this time I got the crashes quickly enough somehow. Did not have to wait > > > > > > for more than half an hour. > > > > > > > > > > > > > And print the address to store prev-next. > > > > > > > > > > > > Isn't that what's given by list_del corruption info? > > > > > > > > > > It only prints the content of prev->next, not without the address of > > > > > prev->next, I just want to make sure this address is dword aligned. > > > > > > > > Ok. > > > > > > > > > > > > > > [ 476.880749] list_del corruption. prev->next should be 8daf74c0, but was 74c08daf > > > > > > > > > > > > > > > > > Interesting that atleast one more person Felipe Tonello sees the same issue. > > > > > > > > > > > > Felipe mentions a DMA issue, I saw a DMA error message from ci_hdrc once in the > > > > > > last five times I tried but mistakenly I did not take that one down. The message > > > > > > was something along the lines "ci_hdrc: ci_hdrc bad dma alloc" or similar. > > > > > > > > > > Make sure you really see dma_pool_alloc fail or not, it may not the same > > > > > problem > > > > > > > > That message was exactly > > > > > > > > [ 1186.114496] ci_hdrc ci_hdrc.0: dma_pool_free ci_hw_td, (null)/8d3c1e6c (bad dma) > > > > > > > > > > Does above message occur just close to linked list corruption? > > > Or it is during the correct transfer process? > > > > Just before the NULL pointer dereference. > > > > [ 1185.863281] WARNING: CPU: 0 PID: 240 at lib/list_debug.c:59 __list_del_entry+0xc4/0xe8() > > [ 1185.871377] list_del corruption. prev->next should be 8ac25b80, but was 8ac25440 > > [ 1185.878776] Modules linked in: > > [ 1185.881849] CPU: 0 PID: 240 Comm: lxpanel Tainted: G W 4.1.5-00004-g326879d #327 > > [ 1185.890373] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree) > > [ 1185.896818] Backtrace: > > [ 1185.899295] [<80012b78>] (dump_backtrace) from [<80012d98>] (show_stack+0x18/0x1c) > > [ 1185.906870] r7:802a5ff4 r6:0000003b r5:00000009 r4:00000000 > > [ 1185.912607] [<80012d80>] (show_stack) from [<80590990>] (dump_stack+0x24/0x28) > > [ 1185.919850] [<8059096c>] (dump_stack) from [<80023e24>] (warn_slowpath_common+0x88/0xb4) > > [ 1185.927953] [<80023d9c>] (warn_slowpath_common) from [<80023e88>] (warn_slowpath_fmt+0x38/0x40) > > [ 1185.936654] r8:8daeda3c r7:8e02f6e8 r6:8daeda00 r5:00000008 r4:80704238 > > [ 1185.943444] [<80023e54>] (warn_slowpath_fmt) from [<802a5ff4>] (__list_del_entry+0xc4/0xe8) > > [ 1185.951798] r3:8ac25b80 r2:80704238 > > [ 1185.955399] r4:8daeda3c > > [ 1185.957957] [<802a5f30>] (__list_del_entry) from [<803a0740>] (udc_irq+0x3d8/0xcdc) > > [ 1185.965628] [<803a0368>] (udc_irq) from [<8039d6bc>] (ci_irq+0x58/0x11c) > > [ 1185.972331] r10:807ddefe r9:8e0bd480 r8:00000027 r7:00000000 r6:00000000 r5:807bd6c4 > > [ 1185.980245] r4:8e02f010 > > [ 1185.982805] [<8039d664>] (ci_irq) from [<8004d670>] (handle_irq_event_percpu+0x80/0x148) > > [ 1185.990894] r5:807bd6c4 r4:8e2d0880 > > [ 1185.994512] [<8004d5f0>] (handle_irq_event_percpu) from [<8004d768>] (handle_irq_event+0x30/0x40) > > [ 1186.003382] r10:00000023 r9:00000020 r8:8e006000 r7:00000000 r6:00000000 r5:807bd6c4 > > [ 1186.011296] r4:8e0bd480 > > [ 1186.013856] [<8004d738>] (handle_irq_event) from [<8004fce0>] (handle_fasteoi_irq+0xa4/0x16c) > > [ 1186.022379] r5:807bd6c4 r4:8e0bd480 > > [ 1186.025997] [<8004fc3c>] (handle_fasteoi_irq) from [<8004cdd8>] (generic_handle_irq+0x34/0x44) > > [ 1186.034606] r5:00000027 r4:00000027 > > [ 1186.038224] [<8004cda4>] (generic_handle_irq) from [<8004d03c>] (__handle_domain_irq+0x5c/0xb0) > > [ 1186.046921] r5:00000027 r4:807bd4fc > > [ 1186.050536] [<8004cfe0>] (__handle_domain_irq) from [<80009364>] (gic_handle_irq+0x2c/0x5c) > > [ 1186.058888] r9:00000020 r8:10c5387d r7:90002100 r6:8d28bfb0 r5:807ac364 r4:9000210c > > [ 1186.066727] [<80009338>] (gic_handle_irq) from [<80013ac8>] (__irq_usr+0x48/0x60) > > [ 1186.074217] Exception stack(0x8d28bfb0 to 0x8d28bff8) > > [ 1186.079282] bfa0: 00000033 00000000 01c85a60 00000000 > > [ 1186.087470] bfc0: 01d8a980 00000018 00000020 01c85648 00000003 00000020 00000023 00000033 > > [ 1186.095660] bfe0: 00000001 7e930880 76849fac 7684a088 80070010 ffffffff > > [ 1186.102280] r7:10c5387d r6:ffffffff r5:80070010 r4:7684a088 > > [ 1186.108000] ---[ end trace f2242ccc35feca1b ]--- > > [ 1186.114496] ci_hdrc ci_hdrc.0: dma_pool_free ci_hw_td, (null)/8d3c1e6c (bad dma) > > [ 1186.122150] Unable to handle kernel NULL pointer dereference at virtual address 00000000 > > > > I suspect it is memory (including cache) unstable problem, not > the controller problem or software problem. But in that case it should have been reproducible with a Linux setup as well. And I cannot reproduce it on a Linux setup. I have the same scripts running the same tests since morning but this time using the USB Client Ethernet connection to my Linux machine and nothing yet. I also tried disabling the PL310 errata Stefan referred. No change in above behaviour. I will keep digging. - Sanchayan. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html