On Sun, 2024-11-17 at 16:02 -0500, Alan Stern wrote: > On Sun, Nov 17, 2024 at 11:47:52PM +0800, Markus Rechberger wrote: > > On Sun, 2024-11-17 at 10:18 -0500, Alan Stern wrote: > > > On Sun, Nov 17, 2024 at 08:44:16PM +0800, Markus Rechberger > > > wrote: > > > > Basically the issue comes from hub_port_connect. > > > > > > > > drivers/usb/core/hub.c > > > > > > > > hub_port_init returns -71 -EPROTO and jumps to loop > > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L5450 > > > > > > > > I'd question if usb_ep0_reinit is really required in loop which > > > > is > > > > running following functions: > > > This should only go down there in case an error happened earlier > > no? > > Of course, since -EPROTO indicates there was an error. Alan, first of all thank you for your feedback. Do you know any practical way how to test this? > > > In case the hardware signed up correctly it should not even enter > > that > > code. > > > > My experience is just - reconnect the device in case an error > > happened > > You can't reconnect some devices; they are permanently attached. > There > are other reasons why reconnecting might not be practical. I understand what you mean here but isn't it primarily about downsizing USB 2.0 -> USB 1.1? The next point would be if eg. an endpoint of a webcam downsizes to 1.1 you don't have to expect a workable device because video would exceed the bandwidth easily. Do you know any practical example/hardware where this is really relevant? The failure handling will call the xhci reset which is not meant to be run without fully initialized data structures. In my case the driver returns -EPROTO (while the device is already disconnected), then first of all it tries to reset the endpoint which in the upstream kernel causes the NULL PTR exeption. For XHCI this code stream logic is simply not meant to be applied in such an error. My first submitted patch will solve the issue but just papers a higher logic issue. > > > those > > workarounds did not work properly for the device I deal with (but > > yes > > that's > > why I'm asking - maybe someone else has different hardware with > > different > > experience). > > I doubt we will hear from these people, because they will not realize > that anything was wrong. > > In any case, it is generally recognized that the type of errors > leading > to -EPROTO (bad CRC or no response, for instance) can be temporary or > intermittent. Retrying is an appropriate strategy. > > > > > As a second issue I found in usb_reset_and_verify device > > > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/hub.c#L6131 > > > > > > > > ret = hub_port_init(parent_hub, udev, port1, i, > > > > &descriptor); > > > > if (ret >= 0 || ret == -ENOTCONN || ret == -ENODEV) { > > > > break; > > > > } > > > > > > > > hub_port_init can also return -71 / -EPROTO, the cases should > > > > be > > > > very > > > > rare when usb_reset_and_verify_device is triggered and that > > > > happens. > > > > > > If that happens, the loop which this code sits inside will simply > > > perform another iteration. That's what it's supposed to do, not > > > an > > > issue at all. > > > > > > > It doesn't cause any issue yes but it's not correct either. > > Why not? What's wrong with it? > > > -EPROTO will be returned if I disconnect or short the device here. > > So > > initially someone > > thought he should check for -ENODEV/-ENOTCONN (which should also > > indicate that the device is gone), so -EPROTO should also be > > checked in > > that case. > > Otherwise just remove all those error checks. > > -EPROTO does not necessarily indicate the device is gone; it > indicates > there was a problem communicating with the device. The problem might > be > caused by a disconnection, or it might be caused by temporary > electromagnetic interference (for example), or by something else. a bad CRC response on a short cable - something has gone very wrong then. If there are temporary EMI issues ... I'd consider that an as an a absolute failure as well, they need to be handled in HW rather than a software workaround. In my experience with USB anything that is a 'temporary' failure can be considered as 'permanent' failure and I've really seen a lot over the last 1 1/2 decades. However issues are mostly related to immature controllers / missing quirks for some controllers. Our devices in the field since 2008 usually pump around 100-300mbit through the USB 2 link, streaming devices which usually run for a long period of time (up to months / years). 'retrying' something on a link where something has gone wrong for sure never worked properly for me, it would have continued with another followup issue at some point. I'm definitely in favor of telling the user if something went wrong either reconnect or submit a bug report for that particular device to add some quirk for special handling. Anyway can you give a particular example where this 'retrying' mechanism and reloading the endpoint size solves or solved a problem? Over the years people will certainly forget why this was implemented and take the code 'as is' and the next ones will just say it was always like that don't change. I don't mind keeping it as it is but it should be more clear / obvious why it's really done that way. Markus > > Alan Stern >