On Thu, Jan 24, 2019 at 10:22:26AM -0500, Alan Stern wrote: > On Thu, 24 Jan 2019, Johan Hovold wrote: > > > > At least when -EPROTO errors are caused by device disconnect, we know > > > that they will eventually go away when the upstream hub reports the > > > port disconnect event. But until then, an interrupt storm is certainly > > > possible. > > > > Indeed, and this isn't the first time we've had this discussion either I > > realised. > > > > In fact, I've been hit by this myself on BBB and musb when disconnecting > > a device connected through an external hub. > > > > At the time, the immediate causes that were making the completion > > handler take longer than usual (unaligned copy of a large buffer and a > > printk respectively) could be fixed. The problem went away, but of > > course not the underlying issue. > > > > Note that the problem I was seeing also went away both when switching to > > a different single-core SoC using ehci-omap, or when replacing the > > external hub. IIRC it wasn't the hub workqueue that was starved as a I > > first had thought, but the hub interrupt was never even received (or > > processed at least). > > > > Unfortunately, I never got around to investigating this further. > > I guess now we have some motivation to look into this more closely. :-) > > > > > > My point was that unless you fix this at the HCD level, you will need to > > > > > add complex recovery handling to every USB driver and completion handler > > > > > (~500 of those). But perhaps that is what it needed. > > > > > > > > okay, it probably make sense to handle the case in HCD because the > > > > number of HCD is much less. > > > > > > One possibility is to giveback URBs with certain errors (such as > > > -EPROTO) only at a frame boundary, or at 1-ms intervals. This feels > > > like a very artificial solution, though. > > > > > > > > I do see now that of all USB drivers we have two drivers that handles > > > > > -EPROTO by resubmitting after a delay, while a handful explicitly deals > > > > > with -EPROTO by simply stopping to resubmit (some probably bail out on > > > > > all errors, but the majority appear to resubmit on -EPROTO). > > > > > > Any driver which immediately retries an URB after getting -EPROTO or > > > -EILSEQ or -ETIME, and has no mechanism for backing off or limiting the > > > retries, is buggy. As far as I can see, that's all there is to it. > > > > I tend to agree with you on that, but adding complex back-off handling > > of intermittent errors to every USB driver and completion handler will > > be quite a bit of work. Unless the HCD drivers can assist, we'd at least > > want have some part of the implementation provided by shared code. > > > > Also note that simply starting to bail out on any error now would risk > > introducing regressions for setups where intermittent errors do occur. > > Agreed. > > > > Why doesn't the same problem occur with other types of host controller? > > > > I think we should get to the bottom of that. BBB is single core which > > may be part of the reason why it's affected, but it's definitely related > > to the controller as well. > > Perhaps it has something to do with the timing of the completion > interrupts. I don't know anything about how musb works, though. Some > low-level timing information would be good to see. The musb controller driver itself doesn't have a isr BH, so when an ep interrupt happened, the isr directly processes the urb and called giveback. I tried to add HCD_BH to the musb hcd .flag, the issue still happens. Regards, -Bin.