Re: random + 8250-omap + edma: dma_ccerr_handler "did not care"

Peter Ujfalusi <peter.ujfalusi@xxxxxx> · Thu, 19 May 2016 13:19:22 +0300

On 05/19/16 05:42, Matthijs van Duin wrote:
> I already had occasional random failures rebooting my bbb, but it
> happened rarely and I hadn't investigated yet.
> 
> While debugging another issue I turned off the "quiet" option and as a
> side-effect I discovered the cause of the failures:
> 
> random: nonblocking pool is initialized
> irq 187: nobody cared (try booting with the "irqpoll" option)
> CPU: 0 PID: $varies Comm: $varies Not tainted 4.6.0-bone3-dd1 #2
> Hardware name: Generic AM33XX (Flattened Device Tree)
> [<c010b059>] (unwind_backtrace) from [<c0109945>] (show_stack+0x11/0x14)
> [<c0109945>] (show_stack) from [<c014802f>] (__report_bad_irq+0x23/0x84)
> [<c014802f>] (__report_bad_irq) from [<c01482a1>] (note_interrupt+0x1c5/0x200)
> [<c01482a1>] (note_interrupt) from [<c0146a13>] (handle_irq_event_percpu+0xfb/0x154)
> [<c0146a13>] (handle_irq_event_percpu) from [<c0146a8d>] (handle_irq_event+0x21/0x2c)
> [<c0146a8d>] (handle_irq_event) from [<c01486d1>] (handle_level_irq+0x61/0xac)
> [<c01486d1>] (handle_level_irq) from [<c014633d>] (generic_handle_irq+0x1d/0x28)
> [<c014633d>] (generic_handle_irq) from [<c01464ff>] (__handle_domain_irq+0x3b/0x80)
> [<c01464ff>] (__handle_domain_irq) from [<c044284d>] (__irq_svc+0x4d/0x74)
> [<c044284d>] (__irq_svc) from [<c0121d52>] (__do_softirq+0x66/0x1c8)
> [<c0121d52>] (__do_softirq) from [<c0122229>] (irq_exit+0x95/0xbc)
> [<c0122229>] (irq_exit) from [<c0146503>] (__handle_domain_irq+0x3f/0x80)
> [<c0146503>] (__handle_domain_irq) from [<c044284d>] (__irq_svc+0x4d/0x74)
> [<c044284d>] (__irq_svc) from [<c0145656>] (console_unlock+0x26e/0x410)
> [<c0145656>] (console_unlock) from [<c01459b5>] (vprintk_emit+0x1bd/0x310)
>   (rest of traceback varies)
> handlers:
> [<c02c5e6d>] dma_ccerr_handler
> Disabling IRQ #187

I have not seen this happening on my boards, but when the DMA was enabled for
omap8250 UART, it opened up a can of worms...
Can you try to apply this patch:
https://lkml.org/lkml/2016/5/10/315

It has been observed in the past the when UART DMA is enabled we will receive
DMA events even if there is nothing pending for DMA. If we did not ask eDMA to
recheck the status we could end up receiving a flood of interrupts and the
kernel will disable the interrupt line.

> To be honest, I can't even begin to speculate what's going on here.  I
> checked dma_ccerr_handler but I don't see how it could fail to clear the
> error irq.  And I didn't include the "random: nonblocking pool is
> initialized" message right before the traceback by accident.  So far
> it's been there every single time.
> 
> The exact moment this happens varies, I just made it easier to trigger
> by increasing the volume of console output.  Probably.  Repeatedly
> dumping a large pile of output to /dev/console failed to trigger it
> though.  It does however on rare occasion also happen on shutdown.
> 
> Whenever it occurs during boot, often things eventually get stuck
> resulting in hung task tracebacks in out_of_line_wait_on_bit() in ext4
> code.  But not always.  I haven't seen it block shutdown.
> 
> I've confirmed I can also reproduce it using mainline v4.6.  My config
> file can be found here:
> https://github.com/dutchanddutch/bb-kernel/blob/am33x-v4.6/patches/defconfig
> The only change needed to build with mainline is clearing EXTRA_FIRMWARE

I'll try to reproduce this on my side. It is just almost impossible to debug
since UART is using DMA so prints from DMA want to via UART -> DMA and we have
circular lock :o

-- 
Péter
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html