Re: Highly critical bug in XHCI Controller

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Mon, 18 Nov 2024 11:03:11 -0500

On Mon, Nov 18, 2024 at 01:14:09PM +0800, Markus Rechberger wrote:
> 
> 
> On Sun, 2024-11-17 at 16:02 -0500, Alan Stern wrote:
> > On Sun, Nov 17, 2024 at 11:47:52PM +0800, Markus Rechberger wrote:
> > > On Sun, 2024-11-17 at 10:18 -0500, Alan Stern wrote:
> > > > On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger
> > > > wrote:
> > > > > Basically the issue comes from hub_port_connect.
> > > > > 
> > > > > drivers/usb/core/hub.c
> > > > > 
> > > > > hub_port_init returns -71 -EPROTO and jumps to loop
> > > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450
> > > > > 
> > > > > I'd question if usb_ep0_reinit is really required in loop which
> > > > > is
> > > > > running following functions:
> > 
> > > This should only go down there in case an error happened earlier
> > > no?
> > 
> > Of course, since -EPROTO indicates there was an error.
> 
> Alan, first of all thank you for your feedback.
> 
> Do you know any practical way how to test this?

Not any easy way, no.  Shorting out the signals like you do, but for 
only a few milliseconds, might work.

> > > My experience is just - reconnect the device in case an error
> > > happened
> > 
> > You can't reconnect some devices; they are permanently attached. 
> > There 
> > are other reasons why reconnecting might not be practical.
> 
> 
> I understand what you mean here but isn't it primarily about downsizing
> USB 2.0 -> USB 1.1?

No.  This has nothing to do with changing the speed of the connection.

> The next point would be if eg. an endpoint of a webcam downsizes to 1.1
> you don't have to expect a workable device because video would exceed
> the bandwidth easily.
> Do you know any practical example/hardware where this is really
> relevant?
> 
> The failure handling will call the xhci reset which is not meant to be
> run without fully initialized data structures. In my case the driver
> returns -EPROTO (while the device is already disconnected), then first
> of all it tries to reset the endpoint which in the upstream kernel
> causes the NULL PTR exeption. For XHCI this code stream logic is simply
> not meant to be applied in such an error.
> My first submitted patch will solve the issue but just papers a higher
> logic issue. 

I think it's fair to say that the logic error lies in the xhci-hcd 
driver, not in the hub driver.  xhci-hcd should be prepared for attempts 
to reset endpoint 0 before the device is fully initialized.  After all, 
initializing the device requires us to be able to communicate with it 
(to read its descriptors), which requires ep0 to be working properly.

> > In any case, it is generally recognized that the type of errors
> > leading 
> > to -EPROTO (bad CRC or no response, for instance) can be temporary or
> > intermittent.  Retrying is an appropriate strategy.

> > -EPROTO does not necessarily indicate the device is gone; it
> > indicates 
> > there was a problem communicating with the device.  The problem might
> > be 
> > caused by a disconnection, or it might be caused by temporary 
> > electromagnetic interference (for example), or by something else.
> 
> 
> a bad CRC response on a short cable -  something has gone very wrong
> then.

Just a bad CRC response; the cable length is irrelevant.  Another 
possiblity is a bit-stuffing error.  Yes, something has gone very badly 
wrong, but it may go right again if we wait a few milliseconds and 
retry.

I have seen reports from people who experienced problems of this sort 
caused by interference from a bad fluorescent light fixture or by 
electrical noise from a nearby fan.

> If there are temporary EMI issues ... I'd consider that an as an a
> absolute failure as well, they
> need to be handled in HW rather than a software workaround.
> 
> In my experience with USB anything that is a 'temporary' failure can be
> considered as 'permanent' failure and I've really seen a lot over the
> last 1 1/2 decades.
> However issues are mostly related to immature controllers / missing
> quirks for some controllers.

It's true that errors of this type are quite rare.  Nevertheless, 
retrying is a valid strategy for dealing with them and it doesn't hurt 
anything.

> Our devices in the field since 2008 usually pump around 100-300mbit
> through the USB 2 link,
> streaming  devices which usually run for a long period of time (up to
> months / years).
> 'retrying' something on a link where something has gone wrong for sure
> never worked properly for me, it would have continued with another
> followup issue at some point.

I also changed the ehci-hcd driver to retry multiple times after one of 
these errors, rather than using the strict "3 strikes and you're out" 
approach.  There were cases where that helped.

> I'm definitely in favor of telling the user if something went wrong
> either reconnect or submit a bug report for that particular device to
> add some quirk for special handling.
> 
> Anyway can you give a particular example where this 'retrying'
> mechanism and reloading the endpoint size solves or solved a problem?

I cannot.  The events referred to above occurred many years ago.  They 
were rare then, and they probably are even more rare now.

> Over the years people will certainly forget why this was implemented
> and take the code 'as is'
> and the next ones will just say it was always like that don't change.
> I don't mind keeping it as it is but it should be more clear / obvious
> why it's really done that
> way.

It seems obvious that what the code does is to retry when -EPROTO errors 
occur.  (Also -EILSEQ and -ETIME.)  And the reason for retrying should 
also be pretty obvious: There's a chance that the error will go away 
when the transfer is retried -- why else would you retry something?

This hardly seems like something we need to explain in a comment.  But 
if you want to submit a patch adding such a comment, please feel free to 
do so.

Alan Stern