RE: USB transaction errors causing RCU stalls and kernel panics

Jonas Karlsson <jonas.karlsson@xxxxxxxx> · Tue, 3 Mar 2020 20:08:38 +0000

> 
> On Tue, Mar 03, 2020 at 03:05:50PM +0000, Jonas Karlsson wrote:
> > Hi,
> >
> > We have a board with an NXP i.MX8 SoC. We are running Linux 4.19.35 from
> NXP on the SoC.
> >
> > There is a modem connected to the SoC via USB through a USB hub.
> > The modem presents it self as a cdc-acm device with 4 tty:s.
> >
> > Sometimes we end up in a situation where all transfers over USB generetes
> 'USB transaction Errors".
> > It is likely that the modem is misbehaving. When this happens we get a lot of
> "xhci-cdns3: ERROR unknown event type 37"
> > in the terminal indicating that the xhci event ring is full. This often leads to RCU
> stalls and sometimes Kernel panics.
> >
> > If I enable dynamic debug on xhci_hcd and cdc-acm I can see that all
> > transfers have error code -71 (-EPROTO which in xhci translates to
> > 'USB transaction error"). When this happens it seems like xhci resets
> > the ep, sets TR Deq Ptr to unstall the ep and then a new transfer is
> > started which also fails. This behavior generates a lot of events on
> > the event ring which causes 'ERROR unknown event type 37'. This loop
> > of failing transfers seems to continue until we either unbind the USB driver or
> get a kernel panic. The SoC almost becomes unresponsive since it spends most
> of the time executing usb interrupts.
> >
> > If I pull the reset pin of the USB hub and keep it in reset state at
> > this point, the event loop of failing transfers continues despite
> > there is nothing on the USB bus any longer. The only way to get out of that
> loop is to either unbind the usb driver or power cycle the board.
> >
> > Is this the expected behavior when USB transaction error happens for all
> transfers when using cdc-acm class driver?
> > Or could there be something wrong in the low level USB driver (Cadence
> > in our case)? We need to figure out why we get all the transaction errors but
> we also need to make sure the kernel does not die on us when we have a
> misbehaving USB device.
> > Does anyone have a suggestion on what we could do to improve the stability
> of the kernel in this situation?
> 
> I would blame the xhci-cdns driver as it is the one controlling all of this.
> 
> I don't see this driver in the 4.19 tree, so I think you are going to have to get
> support from the company that provided you with that driver as you are already
> paying for that support from them :)
> 
> good luck!
> 
> greg k-h

Thanks for the feedback! If the cadence driver is the main suspect I totally agree with you.

The reason I posted on this mailing list was that I was afraid that the cdc-acm driver could
be causing new transfers to be started when the previous fails due to USB transaction errors and
then trigger this event storm.
The acm_ctrl_irq() function seems to submit a new urb directly if the previous fails, but I cannot 
say that I understand that code very well yet. The acm_read_bulk_callback() function also seem
to submit a new read urb on USB transaction Errors. But If you think this could not cause this
behavior I will ask our supplier to fix the cdns driver.

BR,
Jonas