> -----Original Message----- > From: Erkka Talvitie <erkka.talvitie@xxxxxxxxx> > Sent: keskiviikko 4. joulukuuta 2019 10.55 > To: 'Alan Stern' <stern@xxxxxxxxxxxxxxxxxxx> > Cc: gregkh@xxxxxxxxxxxxxxxxxxx; linux-usb@xxxxxxxxxxxxxxx; > claus.baumgartner@xxxxxxxxxx > Subject: RE: [RFCv1 1/1] USB: EHCI: Do not return -EPIPE when hub is > disconnected > > Thank you for the comments. > > > -----Original Message----- > > From: Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> > > Sent: tiistai 3. joulukuuta 2019 21.01 > > To: Erkka Talvitie <erkka.talvitie@xxxxxxxxx> > > Cc: gregkh@xxxxxxxxxxxxxxxxxxx; linux-usb@xxxxxxxxxxxxxxx; > > claus.baumgartner@xxxxxxxxxx > > Subject: RE: [RFCv1 1/1] USB: EHCI: Do not return -EPIPE when hub is > > disconnected > > > > On Tue, 3 Dec 2019, Erkka Talvitie wrote: > > > > > Thank you for the comments. > > > > > > > -----Original Message----- > > > > From: Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> > > > > Sent: maanantai 2. joulukuuta 2019 21.44 > > > > To: Erkka Talvitie <erkka.talvitie@xxxxxxxxx> > > > > Cc: gregkh@xxxxxxxxxxxxxxxxxxx; linux-usb@xxxxxxxxxxxxxxx; > > > > claus.baumgartner@xxxxxxxxxx > > > > Subject: Re: [RFCv1 1/1] USB: EHCI: Do not return -EPIPE when hub > > > > is disconnected > > > > > > > > On Fri, 29 Nov 2019, Erkka Talvitie wrote: > > > > > > > > > When disconnecting a USB hub that has some child device(s) > > > > > connected to it (such as a USB mouse), then the stack tries to > > > > > clear halt and reset device(s) which are _already_ physically > > disconnected. > > > > > > > > That behavior is understandable. The kernel doesn't know that the > > > > device has been disconnected until it can process the notification > > > > from an > > > upstream > > > > hub, and it can't process that notification while it's trying to > > > > reset the > > > device. > > > > > > > > > > Ok. I was thinking that in this use case , it should not be trying > > > to clear the halt (and reset the device when the clear halt fails). > > > And this behavior was altered by this RFC. > > > > Actually, the situation is a little different from what I described > > above. When you unplug the high-speed hub, the kernel doesn't know > > the hub has been disconnected until it receives a notification from > > the upstream hub. The kernel checks for those notifications at > > roughly 250-ms intervals, so it can take up to that long before the kernel > realizes the high-speed hub is gone. > > Until that time, the kernel will keep trying to reset and communicate > > with the hub and the devices that were attached to it. > > > > You can see this in the logs that you posted in your original report. > > In each case, the "cannot reset" and -71 errors lasted for less than > > 250 ms. > > > > I just tried doing the same experiment on my PC (which does use all > > Intel hardware and an EHCI controller). Here's the output from when I > > unplugged the high-speed hub: > > > > [ 6321.245528] usb 1-1.4: clear tt 4 (00a0) error -71 [ 6321.250903] > > usb 1-1.4- > > port4: cannot reset (err = -71) [ 6321.255155] usb 1-1.4-port4: cannot > > reset (err = -71) [ 6321.259403] usb 1-1.4-port4: cannot reset (err = > > -71) [ 6321.263657] usb 1-1.4-port4: cannot reset (err = -71) [ > > 6321.267905] usb 1- > > 1.4-port4: cannot reset (err = -71) [ 6321.267910] usb 1-1.4-port4: > > Cannot enable. Maybe the USB cable is bad? > > [ 6321.272155] usb 1-1.4-port4: cannot disable (err = -71) [ > > 6321.276405] usb > > 1-1.4-port4: cannot reset (err = -71) [ 6321.280653] usb 1-1.4-port4: > > cannot reset (err = -71) [ 6321.284905] usb 1-1.4-port4: cannot reset > > (err = -71) [ 6321.289155] usb 1-1.4-port4: cannot reset (err = -71) [ > > 6321.293403] usb 1- > > 1.4-port4: cannot reset (err = -71) [ 6321.293407] usb 1-1.4-port4: > > Cannot enable. Maybe the USB cable is bad? > > [ 6321.297656] usb 1-1.4-port4: cannot disable (err = -71) [ > > 6321.301904] usb > > 1-1.4-port4: cannot reset (err = -71) [ 6321.306152] usb 1-1.4-port4: > > cannot reset (err = -71) [ 6321.310402] usb 1-1.4-port4: cannot reset > > (err = -71) [ 6321.314653] usb 1-1.4-port4: cannot reset (err = -71) [ > > 6321.318904] usb 1- > > 1.4-port4: cannot reset (err = -71) [ 6321.318908] usb 1-1.4-port4: > > Cannot enable. Maybe the USB cable is bad? > > [ 6321.323154] usb 1-1.4-port4: cannot disable (err = -71) [ > > 6321.327404] usb > > 1-1.4-port4: cannot reset (err = -71) [ 6321.331651] usb 1-1.4-port4: > > cannot reset (err = -71) [ 6321.335902] usb 1-1.4-port4: cannot reset > > (err = -71) [ 6321.340155] usb 1-1.4-port4: cannot reset (err = -71) [ > > 6321.344402] usb 1- > > 1.4-port4: cannot reset (err = -71) [ 6321.344406] usb 1-1.4-port4: > > Cannot enable. Maybe the USB cable is bad? > > [ 6321.348652] usb 1-1.4-port4: cannot disable (err = -71) [ > > 6321.352904] usb > > 1-1.4-port4: cannot disable (err = -71) [ 6321.357154] hub 1-1.4:1.0: > > hub_ext_port_status failed (err = -71) [ 6321.437000] usb 1-1.4: USB > > disconnect, device number 9 [ 6321.437010] usb 1-1.4.4: USB > > disconnect, device number 10 > > > > As you can see, the time interval runs from 6321.245 to 6321.437, > > roughly 192 ms < 250 ms. This is the expected behavior. > > > > I did not try to check whether the MMF bit got set or what value CERR had. > > > > Ok, makes sense. Thank you for testing this with your setup. > > > > > > But as in this use case the CERR has not been decreased yet there is > > > error bit active (MMF) do you see it is still correct to interpret > > > it as a stall (even when the halt bit is set)? > > > > See below. > > > > > I have tried to find out more details about our EHCI controller > > > version, but I have only found out those CPU versions. It might help > > > in a search whether this could be a HW issue. > > > > > > > > > The specification [2] is not clear about error priorities, but > > > > > since there is no explicit error bit for the stall, it is > > > > > assumed to be lower priority than other errors. > > > > > > > > On the contrary, the specification is very clear. Since > > > > transaction > > > errors cause > > > > CERR to be decremented until it reaches 0, a nonzero value for > > > > CERR > > means > > > > the endpoint was halted for some other reason. And the only other > > reason > > > > is a stall. (Or end of the microframe, but there's no way to tell > > > > if that > > > > happened.) > > > > > > I see your point. EHCI specification states that babble is a serious > > > error and it will also cause the halt. The babble error bit is > > > checked first. But the specification does not say about order of the > > > other errors or about > > what > > > to do if there is an error, no retries executed, yet a halt (stall). > > > For example should the XactErr be checked before the MMF. > > > > I think the order doesn't matter. In fact, it's possible that both > > errors occurred, since the transaction gets retried multiple times. > > True as the error bits are "sticky". > > For example XactErr in EHCI specification [2] table 3-16: > " If the host controller sets this bit to a one, then it remains a one for the > duration of the transfer." > > > > > > >(Or end of the microframe, but there's no way to tell if that > > > >happened.) > > > > > > I was not able to locate this from the specification. Could you > > > please point out where this statement is from? > > > > "Enhanced Host Controller Interface Specification for Universal Serial > > Bus", rev 1.0 (2002), p. 110: > > > > Transaction Error (XactErr). Timeout, data CRC failure, etc. The CErr > > field is decremented and the XactErr bit in the Status field is set to > > a one. The complete split transaction is immediately retried (if Cerr > > is non-zero). If there is not enough time in the micro-frame to > > complete the retry and the endpoint is an IN, or CErr is decremented > > to a zero from a one, the queue is halted. If there is not enough time > > in the micro-frame to complete the retry and the endpoint is an OUT > > and CErr is not zero, then this state is exited (i.e. return to Do > > Start Split). This results in a retry of the entire OUT split > > transaction, at the next poll period. Refer to Chapter 11 Hubs > > (specifically the section full- and low-speed Interrupts) in the USB > > Specification Revision 2.0 for detailed requirements on why these > > errors must be immediately retried. • > > > > There it is, thank you. > > > > Could the way to tell if "end of microframe" happened, be what is > > > done > > here > > > - check for MMF error bit and if CERR has not been decreased? > > > > No, because the "end of microframe" situation happens when the host > > controller is handling a transaction error, whereas MMF gets set when > > the host controller detects an error on the host. > > Ok, I see. > > > > > > > > diff --git a/drivers/usb/host/ehci-q.c > > > > > b/drivers/usb/host/ehci-q.c index 3276304..da7fd12 100644 > > > > > --- a/drivers/usb/host/ehci-q.c > > > > > +++ b/drivers/usb/host/ehci-q.c > > > > > @@ -206,8 +206,9 @@ static int qtd_copy_status ( > > > > > if (token & QTD_STS_BABBLE) { > > > > > /* FIXME "must" disable babbling > > > > device's port too */ > > > > > status = -EOVERFLOW; > > > > > - /* CERR nonzero + halt --> stall */ > > > > > - } else if (QTD_CERR(token)) { > > > > > + /* CERR nonzero and less than > > > > EHCI_TUNE_CERR + halt --> stall. > > > > > + This handles situation where stall comes after > > > > an error. */ > > > > > > > > This comment doesn't make sense. Who cares whether a stall comes > > after > > > > an error or not? It's still a stall and should be reported. > > > > > > This was basically a comment trying to answer to this commit: > > > ba516de332c0 USB: EHCI: check for STALL before other errors > > > > > > "The existing code doesn't do this properly, because it tests for MMF > > > (Missed MicroFrame) and DBE (Data Buffer Error) before testing the > > > retry counter. Thus, if a transaction gets either MMF or DBE the > > > corresponding flag is set and the transaction is retried. If the > > > second attempt receives a STALL then -EPIPE is the correct return > > > value. But the existing code would see the MMF or DBE flag instead > > > and return -EPROTO, -ENOSR, or -ECOMM." > > > > > > The comment tries to explain that it will not revert the fix made in > > > the commit ba516de332c0. > > > > Okay, I get it. You're trying to rely on the strange behavior of the > > MMF bit. > > > > I'm not sure this is a good idea. Suppose MMF gets set for some other > > reason (a genuine error on the host) and then the transaction gets a > > STALL on the next retry. Since host errors don't decrement CERR, your > > patch would cause the driver to return -EPROTO instead of -EPIPE. > > Hmm. I originally understood from the documentation that the MMF does > decrement the CERR. I followed the EHCI specification table 3-16, the row > about the CERR. > There are listed the errors that do decrement the CERR and which do not > decrement. Data buffer error, stall, babble and no error (of course) do not > decrement. > Transaction errors do decrement the CERR. As there was no mention about > MMF I have been in the impression that it is included to the transaction > errors of the table. > > Now I was able to find from the documentation you pointed out (in your > next comment below) that your statement is correct. EHCI specification, > table 4-13 contains Interrupt IN / OUT conditions which prove your point. > > > > > > > > + } else if (QTD_CERR(token) && > > > > QTD_CERR(token) < EHCI_TUNE_CERR) { > > > > > status = -EPIPE; > > > > > > > > If an error occurs and the transaction is retried and the retry > > > > gets a > > > stall, then > > > > the final status should be -EPIPE, not something else. > > > > > > This is how the RFC also works. If the transaction has been retried > > > and gets stall then -EPIPE is returned. > > > Or if there are no errors but there is a stall then -EPIPE is returned. > > > > > > The only difference in this patch in comparison to the existing > > > implementation is that if there is an error but the transaction has > > > not been retried it is not interpret as a stall even if there is a > > > halt. > > > > Sometimes that will be the right behavior and other times it won't. > > However, it looks like there may be a way to tell which situation we > > are in. Setting the MMF bit will cause the queue to halt immediately > > if the transaction is IN, but not if it is OUT (see Table 4-13 in the > > EHCI spec). > > Thank you, this Table 4-13 was a vital piece of documentation that I either > missed or failed to read carefully enough. > Now this could explain why the queue is halted with MMF set, even if it > would not be a stall. Well done! > > > > > So if CERR == EHCI_TUNE_CERR and the QTD_PID != 1 (not IN) then we > > should return -EPIPE, as the existing code does. But if QTD_PID == 1 > > then the code should continue, as your patch does -- with one > > difference: Put the additional check for EHCI_TUNE_CERR between the > > tests for DBE and XACT instead of after XACT (because XACT would > > decrement CERR whereas DBE wouldn't). > > Good point, I agree. > > > > > Can you make that change and test it? > > Sure, I have made the change and test it as soon as possible. I am not actually totally sure if I understood you correctly, but I tested a change where the first stall check is like this (PID_CODE_IN is defined as 1): - } else if (QTD_CERR(token)) { + } else if (QTD_CERR(token) && (QTD_PID (token) != PID_CODE_IN)) { status = -EPIPE; And the second stall check (now between DBE and XACT): + } else if (QTD_CERR(token)) { + status = -EPIPE; Is this what you meant? Please correct me if I am wrong. Anyways with these changes the issue does not reproduce. > > > > > Alan Stern > > Erkka Talvitie > Erkka Talvitie