Re: MUSB interrupt storm on device removal

Johan Hovold <johan@xxxxxxxxxx> · Thu, 24 Jan 2019 10:22:59 +0100

On Wed, Jan 23, 2019 at 11:05:40AM -0500, Alan Stern wrote:
> On Wed, 23 Jan 2019, Bin Liu wrote:
> 
> > On Wed, Jan 23, 2019 at 03:55:47PM +0100, Johan Hovold wrote:
> > > On Wed, Jan 23, 2019 at 08:09:47AM -0600, Bin Liu wrote:
> > > > On Wed, Jan 23, 2019 at 09:55:49AM +0100, Johan Hovold wrote:
> > > > > On Wed, Jan 23, 2019 at 07:52:12AM +0100, Greg Kroah-Hartman wrote:
> > > 
> > > > > > That's not what any other host controller returns when a device is
> > > > > > removed, so either you are going to have to fix all USB drives for this
> > > > > > issue, or you need to fix the musb driver to not send this error for
> > > > > > when a device is removed (hint, do the latter...)
> > > > > 
> > > > > Right, this needs to be handle at the HCD level.
> > > > 
> > > > Any reason usb_serial_generic_read_bulk_callback() doesn't handle
> > > > -EPROTO in the same way as -EPIPE?
> > > 
> > > Since it is supposed to be intermittent unlike, for example, -ENOENT or
> > > -EPIPE (the latter which the device driver can recover from if it cares
> > > to implement clearing of halt).
> 
> Wait a minute.  Nothing says any of those errors is supposed to be 
> intermittent.  As long as an error has a systematic cause (as opposed 
> to random noise, for example), it will recur as often as the cause 
> does.

Right, I should have said drivers tend to treat it as always being
intermittent.

> At least when -EPROTO errors are caused by device disconnect, we know 
> that they will eventually go away when the upstream hub reports the 
> port disconnect event.  But until then, an interrupt storm is certainly 
> possible.

Indeed, and this isn't the first time we've had this discussion either I
realised.

In fact, I've been hit by this myself on BBB and musb when disconnecting
a device connected through an external hub.

At the time, the immediate causes that were making the completion
handler take longer than usual (unaligned copy of a large buffer and a
printk respectively) could be fixed. The problem went away, but of
course not the underlying issue.

Note that the problem I was seeing also went away both when switching to
a different single-core SoC using ehci-omap, or when replacing the
external hub. IIRC it wasn't the hub workqueue that was starved as a I
first had thought, but the hub interrupt was never even received (or
processed at least).

Unfortunately, I never got around to investigating this further.

> > > My point was that unless you fix this at the HCD level, you will need to
> > > add complex recovery handling to every USB driver and completion handler
> > > (~500 of those). But perhaps that is what it needed.
> > 
> > okay, it probably make sense to handle the case in HCD because the
> > number of HCD is much less.
> 
> One possibility is to giveback URBs with certain errors (such as
> -EPROTO) only at a frame boundary, or at 1-ms intervals.  This feels 
> like a very artificial solution, though.
> 
> > > I do see now that of all USB drivers we have two drivers that handles
> > > -EPROTO by resubmitting after a delay, while a handful explicitly deals
> > > with -EPROTO by simply stopping to resubmit (some probably bail out on
> > > all errors, but the majority appear to resubmit on -EPROTO).
> 
> Any driver which immediately retries an URB after getting -EPROTO or
> -EILSEQ or -ETIME, and has no mechanism for backing off or limiting the
> retries, is buggy.  As far as I can see, that's all there is to it.

I tend to agree with you on that, but adding complex back-off handling
of intermittent errors to every USB driver and completion handler will
be quite a bit of work. Unless the HCD drivers can assist, we'd at least
want have some part of the implementation provided by shared code.

Also note that simply starting to bail out on any error now would risk
introducing regressions for setups where intermittent errors do occur.

> > Thanks for the info.
> > I will handle this case in musb driver.
> 
> Why doesn't the same problem occur with other types of host controller?

I think we should get to the bottom of that. BBB is single core which
may be part of the reason why it's affected, but it's definitely related
to the controller as well.

Johan