Re: USB transaction errors causing RCU stalls and kernel panics

Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> · Wed, 4 Mar 2020 14:11:42 +0200

On 3.3.2020 22.08, Jonas Karlsson wrote:
>>
>> On Tue, Mar 03, 2020 at 03:05:50PM +0000, Jonas Karlsson wrote:
>>> Hi,
>>>
>>> We have a board with an NXP i.MX8 SoC. We are running Linux 4.19.35 from
>> NXP on the SoC.
>>>
>>> There is a modem connected to the SoC via USB through a USB hub.
>>> The modem presents it self as a cdc-acm device with 4 tty:s.
>>>
>>> Sometimes we end up in a situation where all transfers over USB generetes
>> 'USB transaction Errors".
>>> It is likely that the modem is misbehaving. When this happens we get a lot of
>> "xhci-cdns3: ERROR unknown event type 37"
>>> in the terminal indicating that the xhci event ring is full. This often leads to RCU
>> stalls and sometimes Kernel panics.
>>>
>>> If I enable dynamic debug on xhci_hcd and cdc-acm I can see that all
>>> transfers have error code -71 (-EPROTO which in xhci translates to
>>> 'USB transaction error"). When this happens it seems like xhci resets
>>> the ep, sets TR Deq Ptr to unstall the ep and then a new transfer is
>>> started which also fails.

Note that these are all xhci internal endpoint state operations, the device
(modem) is not seeing any of these changes on its side of the endpoint.

In 4.19  kernel xhci will give back the URB with transaction error with a
-EPROTO status Immediately in the interrupt handler.
A Transaction error, just like a stall error will cause the xHC internal
endpoint state to go to halted, xhci driver needs to reset the "xhci internal"
endpoint state to move it to a stopped state, and move tell xHC controller to
move past that URB in the ring buffer with a Set TR Deq ptr command
(clears xHC controller internal cache as well)

If the ring is not empty when Set TR Deq ptr completes, then driver restarts it.
In this case it appears cdc_acm managed to queue back the URB before this,
restarting the ring. This was repeated over and over again.

>>> This behavior generates a lot of events on
>>> the event ring which causes 'ERROR unknown event type 37'. This loop
>>> of failing transfers seems to continue until we either unbind the USB driver or
>> get a kernel panic. The SoC almost becomes unresponsive since it spends most
>> of the time executing usb interrupts.
>>>
>>> If I pull the reset pin of the USB hub and keep it in reset state at
>>> this point, the event loop of failing transfers continues despite
>>> there is nothing on the USB bus any longer. The only way to get out of that
>> loop is to either unbind the usb driver or power cycle the board.
>>>
>>> Is this the expected behavior when USB transaction error happens for all
>> transfers when using cdc-acm class driver?
>>> Or could there be something wrong in the low level USB driver (Cadence
>>> in our case)? We need to figure out why we get all the transaction errors but
>> we also need to make sure the kernel does not die on us when we have a
>> misbehaving USB device.
>>> Does anyone have a suggestion on what we could do to improve the stability
>> of the kernel in this situation?
>>
>> I would blame the xhci-cdns driver as it is the one controlling all of this.
>>
>> I don't see this driver in the 4.19 tree, so I think you are going to have to get
>> support from the company that provided you with that driver as you are already
>> paying for that support from them :)
>>
>> good luck!
>>
>> greg k-h
> 
> Thanks for the feedback! If the cadence driver is the main suspect I totally agree with you.
> 
> The reason I posted on this mailing list was that I was afraid that the cdc-acm driver could
> be causing new transfers to be started when the previous fails due to USB transaction errors and
> then trigger this event storm.
> The acm_ctrl_irq() function seems to submit a new urb directly if the previous fails, but I cannot 
> say that I understand that code very well yet. The acm_read_bulk_callback() function also seem
> to submit a new read urb on USB transaction Errors. But If you think this could not cause this
> behavior I will ask our supplier to fix the cdns driver.
> 
> BR,
> Jonas
> 

I recently got a report about a bit similar issue on a 4.4 stable kernel, so this
might not be xhci-cdns specific.

That case involved autosuspend of the cdc-acm, and there was only a short burst of
transaction erros and resubmitted URBs even if the device was supposed to be suspended.
It looks like cdc_acm autosuspended even if it had URBs pending.

I'm guessing that in that case the transfer ring restarted even if link was already "suspeded",
causing transaction errors. Ring could be restarted if URBs were resubmitted
by the class driver when usb core suspends all interfaces, flushing all pending URBs which
calls the URB completion handler.

-Mathias