Re: USB client crash on Vybrid with USB gadget RNDIS connection

Peter Chen <peter.chen@xxxxxxxxxxxxx> · Tue, 22 Sep 2015 13:42:03 +0800

On Tue, Sep 22, 2015 at 11:52:52AM +0530, maitysanchayan@xxxxxxxxx wrote:
> On 15-09-22 07:36:01, Peter Chen wrote:
> > On Mon, Sep 21, 2015 at 06:56:34PM +0530, maitysanchayan@xxxxxxxxx wrote:
> > > On 15-09-21 14:50:18, Peter Chen wrote:
> > > > On Fri, Sep 18, 2015 at 04:01:50PM +0530, maitysanchayan@xxxxxxxxx wrote:
> > > > > On 15-09-18 13:39:11, Peter Chen wrote:
> > > > > > On Wed, Sep 16, 2015 at 02:48:50PM +0530, maitysanchayan@xxxxxxxxx wrote:
> > > > > > > On 15-09-16 15:54:21, Peter Chen wrote:
> > > > > > > > On Wed, Sep 16, 2015 at 02:18:49PM +0530, maitysanchayan@xxxxxxxxx wrote:
> > > > > > > > > Hello Peter,
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Enable CONFIG_DEBUG_LIST, it has below position if you
> > > > > > > > > > run make menuconfig
> > > > > > > > > > Kernel hacking  --->
> > > > > > > > > > [*] Debug linked list manipulation  
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Sorry for the delay. When I enabled this config the first time my test
> > > > > > > > > application ran for 24 hours or so and I did not get any stack traces.
> > > > > > > > > 
> > > > > > > > > I restarted the test again and finally got the trace below. You were
> > > > > > > > > spot on, its a list corruption issue. I modified the trace a bit after
> > > > > > > > > copying to remove the sprinkled debug messages throughout the trace
> > > > > > > > > from my test application.
> > > > > > > > > 
> > > > > > > > > [  622.204134] WARNING: CPU: 0 PID: 0 at lib/list_debug.c:59 __list_del_entry+0xc4/0xe8()
> > > > > > > > > [  622.212870] list_del corruption. prev->next should be 8db63600, but was 36008db6
> > > > > > > > 
> > > > > > > > You see the higher 16 bits were swapped with lower 16 bits, and the
> > > > > > > > virtual memory address should begin from 0x8xxxxxxxx, right?
> > > > > > > 
> > > > > > > Yes, I saw that but beats me how this happens.
> > > > > > > 
> > > > > > > > 
> > > > > > > > Check with Vybrid errata to see if all ARM/memory system have applied.
> > > > > > > 
> > > > > > > What do you mean by "all ARM/memory system have applied" ? I checked with the Vybrid errata
> > > > > > > and I do not see anything related.
> > > > > > > 
> > > > > > 
> > > > > > Just system level errata, like ARM Cortex A5, memory (L1/L2 Cache), etc.
> > > > > > 
> > > > > > Would you please do more tests to see if the error pattern is always
> > > > > > the same?
> > > > > 
> > > > > I got more or less the same logs as below the last five times I tried today
> > > > > and this time I got the crashes quickly enough somehow. Did not have to wait
> > > > > for more than half an hour.
> > > > > 
> > > > > > And print the address to store prev-next.
> > > > > 
> > > > > Isn't that what's given by list_del corruption info?
> > > > 
> > > > It only prints the content of prev->next, not without the address of
> > > > prev->next, I just want to make sure this address is dword aligned.
> > > 
> > > Ok.
> > > 
> > > > 
> > > > [  476.880749] list_del corruption. prev->next should be 8daf74c0, but was 74c08daf
> > > > 
> > > > > 
> > > > > Interesting that atleast one more person Felipe Tonello sees the same issue.
> > > > > 
> > > > > Felipe mentions a DMA issue, I saw a DMA error message from ci_hdrc once in the
> > > > > last five times I tried but mistakenly I did not take that one down. The message
> > > > > was something along the lines "ci_hdrc: ci_hdrc bad dma alloc" or similar.
> > > > 
> > > > Make sure you really see dma_pool_alloc fail or not, it may not the same
> > > > problem
> > > 
> > > That message was exactly
> > > 
> > > [ 1186.114496] ci_hdrc ci_hdrc.0: dma_pool_free ci_hw_td,   (null)/8d3c1e6c (bad dma)
> > > 
> > 
> > Does above message occur just close to linked list corruption?
> > Or it is during the correct transfer process?
> 
> Just before the NULL pointer dereference.
> 
> [ 1185.863281] WARNING: CPU: 0 PID: 240 at lib/list_debug.c:59 __list_del_entry+0xc4/0xe8()
> [ 1185.871377] list_del corruption. prev->next should be 8ac25b80, but was 8ac25440
> [ 1185.878776] Modules linked in:
> [ 1185.881849] CPU: 0 PID: 240 Comm: lxpanel Tainted: G        W       4.1.5-00004-g326879d #327
> [ 1185.890373] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
> [ 1185.896818] Backtrace:
> [ 1185.899295] [<80012b78>] (dump_backtrace) from [<80012d98>] (show_stack+0x18/0x1c)
> [ 1185.906870]  r7:802a5ff4 r6:0000003b r5:00000009 r4:00000000
> [ 1185.912607] [<80012d80>] (show_stack) from [<80590990>] (dump_stack+0x24/0x28)
> [ 1185.919850] [<8059096c>] (dump_stack) from [<80023e24>] (warn_slowpath_common+0x88/0xb4)
> [ 1185.927953] [<80023d9c>] (warn_slowpath_common) from [<80023e88>] (warn_slowpath_fmt+0x38/0x40)
> [ 1185.936654]  r8:8daeda3c r7:8e02f6e8 r6:8daeda00 r5:00000008 r4:80704238
> [ 1185.943444] [<80023e54>] (warn_slowpath_fmt) from [<802a5ff4>] (__list_del_entry+0xc4/0xe8)
> [ 1185.951798]  r3:8ac25b80 r2:80704238
> [ 1185.955399]  r4:8daeda3c
> [ 1185.957957] [<802a5f30>] (__list_del_entry) from [<803a0740>] (udc_irq+0x3d8/0xcdc)
> [ 1185.965628] [<803a0368>] (udc_irq) from [<8039d6bc>] (ci_irq+0x58/0x11c)
> [ 1185.972331]  r10:807ddefe r9:8e0bd480 r8:00000027 r7:00000000 r6:00000000 r5:807bd6c4
> [ 1185.980245]  r4:8e02f010
> [ 1185.982805] [<8039d664>] (ci_irq) from [<8004d670>] (handle_irq_event_percpu+0x80/0x148)
> [ 1185.990894]  r5:807bd6c4 r4:8e2d0880
> [ 1185.994512] [<8004d5f0>] (handle_irq_event_percpu) from [<8004d768>] (handle_irq_event+0x30/0x40)
> [ 1186.003382]  r10:00000023 r9:00000020 r8:8e006000 r7:00000000 r6:00000000 r5:807bd6c4
> [ 1186.011296]  r4:8e0bd480
> [ 1186.013856] [<8004d738>] (handle_irq_event) from [<8004fce0>] (handle_fasteoi_irq+0xa4/0x16c)
> [ 1186.022379]  r5:807bd6c4 r4:8e0bd480
> [ 1186.025997] [<8004fc3c>] (handle_fasteoi_irq) from [<8004cdd8>] (generic_handle_irq+0x34/0x44)
> [ 1186.034606]  r5:00000027 r4:00000027
> [ 1186.038224] [<8004cda4>] (generic_handle_irq) from [<8004d03c>] (__handle_domain_irq+0x5c/0xb0)
> [ 1186.046921]  r5:00000027 r4:807bd4fc
> [ 1186.050536] [<8004cfe0>] (__handle_domain_irq) from [<80009364>] (gic_handle_irq+0x2c/0x5c)
> [ 1186.058888]  r9:00000020 r8:10c5387d r7:90002100 r6:8d28bfb0 r5:807ac364 r4:9000210c
> [ 1186.066727] [<80009338>] (gic_handle_irq) from [<80013ac8>] (__irq_usr+0x48/0x60)
> [ 1186.074217] Exception stack(0x8d28bfb0 to 0x8d28bff8)
> [ 1186.079282] bfa0:                                     00000033 00000000 01c85a60 00000000
> [ 1186.087470] bfc0: 01d8a980 00000018 00000020 01c85648 00000003 00000020 00000023 00000033
> [ 1186.095660] bfe0: 00000001 7e930880 76849fac 7684a088 80070010 ffffffff
> [ 1186.102280]  r7:10c5387d r6:ffffffff r5:80070010 r4:7684a088
> [ 1186.108000] ---[ end trace f2242ccc35feca1b ]---
> [ 1186.114496] ci_hdrc ci_hdrc.0: dma_pool_free ci_hw_td,   (null)/8d3c1e6c (bad dma)
> [ 1186.122150] Unable to handle kernel NULL pointer dereference at virtual address 00000000
> 

I suspect it is memory (including cache) unstable problem, not
the controller problem or software problem.

-- 

Best Regards,
Peter Chen
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html