Re: MUSB interrupt storm on device removal

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Thu, 24 Jan 2019 10:22:26 -0500 (EST)

On Thu, 24 Jan 2019, Johan Hovold wrote:

> > At least when -EPROTO errors are caused by device disconnect, we know 
> > that they will eventually go away when the upstream hub reports the 
> > port disconnect event.  But until then, an interrupt storm is certainly 
> > possible.
> 
> Indeed, and this isn't the first time we've had this discussion either I
> realised.
> 
> In fact, I've been hit by this myself on BBB and musb when disconnecting
> a device connected through an external hub.
> 
> At the time, the immediate causes that were making the completion
> handler take longer than usual (unaligned copy of a large buffer and a
> printk respectively) could be fixed. The problem went away, but of
> course not the underlying issue.
> 
> Note that the problem I was seeing also went away both when switching to
> a different single-core SoC using ehci-omap, or when replacing the
> external hub. IIRC it wasn't the hub workqueue that was starved as a I
> first had thought, but the hub interrupt was never even received (or
> processed at least).
> 
> Unfortunately, I never got around to investigating this further.

I guess now we have some motivation to look into this more closely.  :-)

> > > > My point was that unless you fix this at the HCD level, you will need to
> > > > add complex recovery handling to every USB driver and completion handler
> > > > (~500 of those). But perhaps that is what it needed.
> > > 
> > > okay, it probably make sense to handle the case in HCD because the
> > > number of HCD is much less.
> > 
> > One possibility is to giveback URBs with certain errors (such as
> > -EPROTO) only at a frame boundary, or at 1-ms intervals.  This feels 
> > like a very artificial solution, though.
> > 
> > > > I do see now that of all USB drivers we have two drivers that handles
> > > > -EPROTO by resubmitting after a delay, while a handful explicitly deals
> > > > with -EPROTO by simply stopping to resubmit (some probably bail out on
> > > > all errors, but the majority appear to resubmit on -EPROTO).
> > 
> > Any driver which immediately retries an URB after getting -EPROTO or
> > -EILSEQ or -ETIME, and has no mechanism for backing off or limiting the
> > retries, is buggy.  As far as I can see, that's all there is to it.
> 
> I tend to agree with you on that, but adding complex back-off handling
> of intermittent errors to every USB driver and completion handler will
> be quite a bit of work. Unless the HCD drivers can assist, we'd at least
> want have some part of the implementation provided by shared code.
> 
> Also note that simply starting to bail out on any error now would risk
> introducing regressions for setups where intermittent errors do occur.

Agreed.

> > Why doesn't the same problem occur with other types of host controller?
> 
> I think we should get to the bottom of that. BBB is single core which
> may be part of the reason why it's affected, but it's definitely related
> to the controller as well.

Perhaps it has something to do with the timing of the completion 
interrupts.  I don't know anything about how musb works, though.  Some 
low-level timing information would be good to see.

Alan Stern