Re: Highly critical bug in XHCI Controller

Markus Rechberger <linuxusb.ml@xxxxxxxxxx> · Mon, 18 Nov 2024 13:14:09 +0800

On Sun, 2024-11-17 at 16:02 -0500, Alan Stern wrote:
> On Sun, Nov 17, 2024 at 11:47:52PM +0800, Markus Rechberger wrote:
> > On Sun, 2024-11-17 at 10:18 -0500, Alan Stern wrote:
> > > On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger
> > > wrote:
> > > > Basically the issue comes from hub_port_connect.
> > > > 
> > > > drivers/usb/core/hub.c
> > > > 
> > > > hub_port_init returns -71 -EPROTO and jumps to loop
> > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450
> > > > 
> > > > I'd question if usb_ep0_reinit is really required in loop which
> > > > is
> > > > running following functions:
> 
> > This should only go down there in case an error happened earlier
> > no?
> 
> Of course, since -EPROTO indicates there was an error.

Alan, first of all thank you for your feedback.

Do you know any practical way how to test this?

> 
> > In case the hardware signed up correctly it should not even enter
> > that
> > code.
> > 
> > My experience is just - reconnect the device in case an error
> > happened
> 
> You can't reconnect some devices; they are permanently attached. 
> There 
> are other reasons why reconnecting might not be practical.

I understand what you mean here but isn't it primarily about downsizing
USB 2.0 -> USB 1.1?
The next point would be if eg. an endpoint of a webcam downsizes to 1.1
you don't have to expect a workable device because video would exceed
the bandwidth easily.
Do you know any practical example/hardware where this is really
relevant?

The failure handling will call the xhci reset which is not meant to be
run without fully initialized data structures. In my case the driver
returns -EPROTO (while the device is already disconnected), then first
of all it tries to reset the endpoint which in the upstream kernel
causes the NULL PTR exeption. For XHCI this code stream logic is simply
not meant to be applied in such an error.
My first submitted patch will solve the issue but just papers a higher
logic issue. 

> 
> > those
> > workarounds did not work properly for the device I deal with (but
> > yes
> > that's
> > why I'm asking - maybe someone else has different hardware with
> > different
> > experience).
> 
> I doubt we will hear from these people, because they will not realize
> that anything was wrong.
> 
> In any case, it is generally recognized that the type of errors
> leading 
> to -EPROTO (bad CRC or no response, for instance) can be temporary or
> intermittent.  Retrying is an appropriate strategy.
> 
> > > > As a second issue I found in usb_reset_and_verify device 
> > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131
> > > > 
> > > >         ret = hub_port_init(parent_hub, udev, port1, i,
> > > > &descriptor);
> > > >         if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) {
> > > >             break;
> > > >         }
> > > > 
> > > > hub_port_init can also return -71 / -EPROTO, the cases should
> > > > be
> > > > very
> > > > rare when usb_reset_and_verify_device is triggered and that
> > > > happens.
> > > 
> > > If that happens, the loop which this code sits inside will simply
> > > perform another iteration.  That's what  it's supposed to do, not
> > > an 
> > > issue at all.
> > > 
> > 
> > It doesn't cause any issue yes but it's not correct either.
> 
> Why not?  What's wrong with it?
> 
> > -EPROTO will be returned if I disconnect or short the device here.
> > So
> > initially someone
> > thought he should check for -ENODEV/-ENOTCONN (which should also
> > indicate that the device is gone), so -EPROTO should also be
> > checked in
> > that case.
> > Otherwise just remove all those error checks.
> 
> -EPROTO does not necessarily indicate the device is gone; it
> indicates 
> there was a problem communicating with the device.  The problem might
> be 
> caused by a disconnection, or it might be caused by temporary 
> electromagnetic interference (for example), or by something else.

a bad CRC response on a short cable -  something has gone very wrong
then.
If there are temporary EMI issues ... I'd consider that an as an a
absolute failure as well, they
need to be handled in HW rather than a software workaround.

In my experience with USB anything that is a 'temporary' failure can be
considered as 'permanent' failure and I've really seen a lot over the
last 1 1/2 decades.
However issues are mostly related to immature controllers / missing
quirks for some controllers.
Our devices in the field since 2008 usually pump around 100-300mbit
through the USB 2 link,
streaming  devices which usually run for a long period of time (up to
months / years).
'retrying' something on a link where something has gone wrong for sure
never worked properly for me, it would have continued with another
followup issue at some point.

I'm definitely in favor of telling the user if something went wrong
either reconnect or submit a bug report for that particular device to
add some quirk for special handling.

Anyway can you give a particular example where this 'retrying'
mechanism and reloading the endpoint size solves or solved a problem?
Over the years people will certainly forget why this was implemented
and take the code 'as is'
and the next ones will just say it was always like that don't change.
I don't mind keeping it as it is but it should be more clear / obvious
why it's really done that
way.

Markus
> 
> Alan Stern
>