RE: [RFCv1 1/1] USB: EHCI: Do not return -EPIPE when hub is disconnected

"Erkka Talvitie" <erkka.talvitie@xxxxxxxxx> · Tue, 3 Dec 2019 12:54:21 +0200

> -----Original Message-----
> From: Erkka Talvitie <erkka.talvitie@xxxxxxxxx>
> Sent: tiistai 3. joulukuuta 2019 11.39
> To: 'Alan Stern' <stern@xxxxxxxxxxxxxxxxxxx>
> Cc: gregkh@xxxxxxxxxxxxxxxxxxx; linux-usb@xxxxxxxxxxxxxxx;
> claus.baumgartner@xxxxxxxxxx
> Subject: RE: [RFCv1 1/1] USB: EHCI: Do not return -EPIPE when hub is
> disconnected
> 
> Thank you for the comments.
> 
> > -----Original Message-----
> > From: Alan Stern <stern@xxxxxxxxxxxxxxxxxxx>
> > Sent: maanantai 2. joulukuuta 2019 21.44
> > To: Erkka Talvitie <erkka.talvitie@xxxxxxxxx>
> > Cc: gregkh@xxxxxxxxxxxxxxxxxxx; linux-usb@xxxxxxxxxxxxxxx;
> > claus.baumgartner@xxxxxxxxxx
> > Subject: Re: [RFCv1 1/1] USB: EHCI: Do not return -EPIPE when hub is
> > disconnected
> >
> > On Fri, 29 Nov 2019, Erkka Talvitie wrote:
> >
> > > When disconnecting a USB hub that has some child device(s) connected
> > > to it (such as a USB mouse), then the stack tries to clear halt and
> > > reset device(s) which are _already_ physically disconnected.
> >
> > That behavior is understandable.  The kernel doesn't know that the
> > device has been disconnected until it can process the notification
> > from an
> upstream
> > hub, and it can't process that notification while it's trying to reset
> > the
> device.
> >
> 
> Ok. I was thinking that in this use case , it should not be trying to
clear the halt
> (and reset the device when the clear halt fails). And this behavior was
altered
> by this RFC.
> 
> > > The issue has been reproduced with:
> > >
> > > CPU: IMX6D5EYM10AD or MCIMX6D5EYM10AE.
> > > SW: U-Boot 2019.07 and kernel 4.19.40.
> > >
> > > In this situation there will be error bit for MMF active yet the
> > > CERR equals EHCI_TUNE_CERR + halt.
> >
> > Why?  In general, setting the MMF bit does not cause the halt bit to
> > be
> set
> > (set Table 4-13 in the EHCI spec).  In fact, MMF refers to errors that
> occur on
> > the host, not bus errors caused by a disconnected device.
> 
> I do not know for sure why that happens. I was suspecting that there has
> been MMF error and a stall at the same time. And in this RFC it was
assumed
> that MMF is with greater priority than stall.
> The disconnecting of a hub with attached devices cause the MMF error bit
> set even though it is a host side error.
> 
> >
> > > Existing implementation
> > > interprets this as a stall [1] (chapter 8.4.5).
> >
> > That is the correct thing to do.  When a transaction error occurs
> > during a Complete-Split transaction, the host controller is supposed
> > to decrement
> the
> > CERR value, set the XACT bit, and retry the transaction unless the
> > CERR
> value
> > is 0 or there isn't enough time left in the microframe.
> >
> > The fact that you saw CERR equal to EHCI_TUNE_CERR and XACT clear
> > probably means that your EHCI hardware is not behaving properly.
> 
> If you refer to the XactErr  bit (Table 4-13 [2] )with the "XACT clear"
then
> unfortunately I did not check it's state ,so I am not sure if it was
clear.
> In this patch, like also in the existing implementation, the MMF bit is
checked
> first and since it is active in this situation the XactErr is not checked.
> 
> I will check this.
> 
> But as in this use case the CERR has not been decreased yet there is error
bit
> active (MMF) do you see it is still correct to interpret it as a stall
(even when
> the halt bit is set)?
> 
> I have tried to find out more details about our EHCI controller version,
but I
> have only found out those CPU versions. It might help in a search whether
> this could be a HW issue.
> 
> >
> > > Fix for the issue is at first to check for a stall that comes after
> > > an error (the CERR has been decreased).
> > >
> > > Then after that, check for other errors.
> > >
> > > And at last check for stall without other errors (the CERR equals
> > > EHCI_TUNE_CERR as stall does not decrease the CERR [2] (table 3-16)).
> > >
> > > What happens after the fix is that when disconnecting a hub with
> > > attached device(s) the situation is not interpret as a stall.
> > >
> > > The specification [2] is not clear about error priorities, but since
> > > there is no explicit error bit for the stall, it is assumed to be
> > > lower priority than other errors.
> >
> > On the contrary, the specification is very clear.  Since transaction
> errors cause
> > CERR to be decremented until it reaches 0, a nonzero value for CERR
> > means the endpoint was halted for some other reason.  And the only
> > other reason is a stall.  (Or end of the microframe, but there's no
> > way to tell if that
> > happened.)
> 
> I see your point. EHCI specification states that babble is a serious error
and it
> will also cause the halt. The babble error bit is checked first. But the
> specification does not say about order of the other errors or about what
to
> do if there is an error, no retries executed, yet a halt (stall). For
example
> should the XactErr be checked before the MMF.
> 
> >(Or end of the microframe, but there's no way to tell if that
> >happened.)
> 
> I was not able to locate this from the specification. Could you please
point
> out where this statement is from?
> Could the way to tell if "end of microframe" happened, be what is done
here
> - check for MMF error bit and if CERR has not been decreased?
> 
> >
> > > [1] https://www.usb.org/document-library/usb-20-specification,
> > > usb_20.pdf [2]
> > >
> >
> https://www.intel.com/content/dam/www/public/us/en/documents/techn
> > ical
> > > -specifications/ehci-specification-for-usb.pdf
> > >
> > > Signed-off-by: Erkka Talvitie <erkka.talvitie@xxxxxxxxx>
> >
> > Can you duplicate this behavior on a standard PC, say with an Intel
> > EHCI controller?
> 
> We tested with native Linux PC and the error did not reproduce. However I
> am not sure about the used host controller in that PC.
> I will check that or try to get a setup with Intel EHCI.

The PC where the issue did not reproduce was ThinkPad T480 with:

3c:00.0 USB controller: Intel Corporation Device 15c1 (rev 01)

[    4.229310] xhci_hcd 0000:3c:00.0: xHCI Host Controller
[    4.238578] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    4.239754] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    4.240857] ehci-pci: EHCI PCI platform driver
[    4.241437] ohci-pci: OHCI PCI platform driver
[    4.243080] uhci_hcd: USB Universal Host Controller Interface driver 

> 
> >
> > >  drivers/usb/host/ehci-q.c | 9 +++++++--
> > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
> > > index 3276304..da7fd12 100644
> > > --- a/drivers/usb/host/ehci-q.c
> > > +++ b/drivers/usb/host/ehci-q.c
> > > @@ -206,8 +206,9 @@ static int qtd_copy_status (
> > >  		if (token & QTD_STS_BABBLE) {
> > >  			/* FIXME "must" disable babbling
> > device's port too */
> > >  			status = -EOVERFLOW;
> > > -		/* CERR nonzero + halt --> stall */
> > > -		} else if (QTD_CERR(token)) {
> > > +		/* CERR nonzero and less than
> > EHCI_TUNE_CERR + halt --> stall.
> > > +		   This handles situation where stall comes after
> > an error. */
> >
> > This comment doesn't make sense.  Who cares whether a stall comes
> > after an error or not?  It's still a stall and should be reported.
> 
> This was basically a comment trying to answer to this commit:
> ba516de332c0  USB: EHCI: check for STALL before other errors
> 
>     "The existing code doesn't do this properly, because it tests for MMF
>     (Missed MicroFrame) and DBE (Data Buffer Error) before testing the
>     retry counter.  Thus, if a transaction gets either MMF or DBE the
>     corresponding flag is set and the transaction is retried.  If the
>     second attempt receives a STALL then -EPIPE is the correct return
>     value.  But the existing code would see the MMF or DBE flag instead
>     and return -EPROTO, -ENOSR, or -ECOMM."
> 
> The comment tries to explain that it will not revert the fix made in the
> commit ba516de332c0.
> 
> 
> >
> > > +		} else if (QTD_CERR(token) &&
> > QTD_CERR(token) < EHCI_TUNE_CERR) {
> > >  			status = -EPIPE;
> >
> > If an error occurs and the transaction is retried and the retry gets a
> stall, then
> > the final status should be -EPIPE, not something else.
> 
> This is how the RFC also works. If the transaction has been retried and
gets
> stall then -EPIPE is returned.
> Or if there are no errors but there is a stall then -EPIPE is returned.
> 
> The only difference in this patch in comparison to the existing
> implementation is that if there is an error but the transaction has not
been
> retried it is not interpret as a stall even if there is a halt.
> 
> >
> > >  		/* In theory, more than one of the following bits
> > can be set @@
> > > -228,6 +229,10 @@ static int qtd_copy_status (
> > >
> > 	usb_pipeendpoint(urb->pipe),
> > >  				usb_pipein(urb-
> > >pipe) ? "in" : "out");
> > >  			status = -EPROTO;
> > > +		/* CERR equals EHCI_TUNE_CERR, no other
> > errors + halt --> stall.
> > > +		   This handles situation where stall comes
> > without error bits set.
> > > +*/
> >
> > If CERR is equal to EHCI_TUNE_CERR then no other errors could have
> > occurred (since any error will decrement CERR).  So why shouldn't this
> case
> > be included with the earlier case?
> 
> That is what I also understood from the EHCI specification. If there is an
error
> the CERR should decrease. Only babble, data buffer error and stall (or no
> error) will not decrement the CERR.
> However in this use case there is an error (MMF) but the CERR still equals
to
> the EHCI_TUNE_CERR.
> 
> So that's why the RFC separates these. This is the logic in the RFC:
> 
> 1. The first if handles the situation where the stall comes after there
has
> been an error AND a retry. CERR has been decreased. This is so that
> ba516de332c0 is not reverted.
> 2. The second if handles the situation where the halt has been caused by
the
> stall AND there are no other errors.
> 3. If there are errors + halt, but no retries executed (CERR equals
> EHCI_TUNE_CERR) the response here is to return error value according to
> the error bit, not returning EPIPE according to the stall.

I am using CERR here confusingly. It is not a retry counter, instead it is
an error counter.

> 
> >
> > > +		} else if (QTD_CERR(token)) {
> > > +			status = -EPIPE;
> > >  		} else {	/* unknown */
> > >  			status = -EPROTO;
> > >  		}
> >
> > Alan Stern
> 
> Erkka Talvitie
> 

Erkka Talvitie