Re: MUSB interrupt storm on device removal

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Wed, 23 Jan 2019 11:05:40 -0500 (EST)

On Wed, 23 Jan 2019, Bin Liu wrote:

> On Wed, Jan 23, 2019 at 03:55:47PM +0100, Johan Hovold wrote:
> > On Wed, Jan 23, 2019 at 08:09:47AM -0600, Bin Liu wrote:
> > > On Wed, Jan 23, 2019 at 09:55:49AM +0100, Johan Hovold wrote:
> > > > On Wed, Jan 23, 2019 at 07:52:12AM +0100, Greg Kroah-Hartman wrote:
> > 
> > > > > That's not what any other host controller returns when a device is
> > > > > removed, so either you are going to have to fix all USB drives for this
> > > > > issue, or you need to fix the musb driver to not send this error for
> > > > > when a device is removed (hint, do the latter...)
> > > > 
> > > > Right, this needs to be handle at the HCD level.
> > > 
> > > Any reason usb_serial_generic_read_bulk_callback() doesn't handle
> > > -EPROTO in the same way as -EPIPE?
> > 
> > Since it is supposed to be intermittent unlike, for example, -ENOENT or
> > -EPIPE (the latter which the device driver can recover from if it cares
> > to implement clearing of halt).

Wait a minute.  Nothing says any of those errors is supposed to be 
intermittent.  As long as an error has a systematic cause (as opposed 
to random noise, for example), it will recur as often as the cause 
does.

At least when -EPROTO errors are caused by device disconnect, we know 
that they will eventually go away when the upstream hub reports the 
port disconnect event.  But until then, an interrupt storm is certainly 
possible.

> Okay, makes sense.
> 
> > 
> > > > dwc2 fixed a similar lockup issue due to retried NAKed transaction by
> > > > not retrying immediately:
> > > > 
> > > > 	38d2b5fb75c1 ("usb: dwc2: host: Don't retry NAKed transactions right away")
> > > 
> > > Both cases are all about device removal, but this musb case is slightly
> > > different from this dwc2 case.
> > > 
> > > It is all about re-transmitting which causes interrupt storm, but in
> > > this dwc2 case, it is the dwc2 driver doing the re-transmitting, so it
> > > makes sense to delay it in the dwc2 driver as this referred patch does,
> > >
> > > but in this musb case, musb driver reports transaction error to the usb
> > > serial driver, the usb serial driver issues the re-transmitting not the
> > > musb driver, so I don't think the delay should be added in the musb
> > > driver.
> > 
> > I didn't say it was exactly the same.
> 
> Yeah, I know. My point was the fix is in the place where re-transmitting
> happens, but
> 
> > My point was that unless you fix this at the HCD level, you will need to
> > add complex recovery handling to every USB driver and completion handler
> > (~500 of those). But perhaps that is what it needed.
> 
> okay, it probably make sense to handle the case in HCD because the
> number of HCD is much less.

One possibility is to giveback URBs with certain errors (such as
-EPROTO) only at a frame boundary, or at 1-ms intervals.  This feels 
like a very artificial solution, though.

> > I do see now that of all USB drivers we have two drivers that handles
> > -EPROTO by resubmitting after a delay, while a handful explicitly deals
> > with -EPROTO by simply stopping to resubmit (some probably bail out on
> > all errors, but the majority appear to resubmit on -EPROTO).

Any driver which immediately retries an URB after getting -EPROTO or
-EILSEQ or -ETIME, and has no mechanism for backing off or limiting the
retries, is buggy.  As far as I can see, that's all there is to it.

> Thanks for the info.
> I will handle this case in musb driver.

Why doesn't the same problem occur with other types of host controller?

Alan Stern