Re: [RFC/PATCH v2] usb: host: xhci: issue 'Reset Endpoint' for !CONTROL EPs from finish_td()

Felipe Balbi <balbi@xxxxxx> · Wed, 12 Nov 2014 16:28:53 -0600

Hi,

On Wed, Nov 12, 2014 at 04:54:06PM -0500, Alan Stern wrote:

[...]

> > and the doorbell will never rung. But even if I drop EP_HALTED from the
> > check below and let EP doorbell be rung, nothing will happen because,
> > according to XHCI spec 1.0, we *must* first issue a Reset Endpoint
> > command to get Endpoint to move to EP_STATE_STOPPED, in which case,
> > ringing that EP's doorbell will actually cause a transfer to happen.
> > 
> > Right now, what happens is that second usb_submit_urb() does nothing and
> > the 10 second timer expires, causing the URB to be canceled and test
> > failing with -ETIMEDOUT.
> 
> Okay, I see.  What usbcore and usbtest expect to happen is this:
> 
>     (1)	An URB fails because the endpoint sent a STALL.  The completion
> 	routine is called with status -EPIPE.
> 
>     (2)	When the completion routine returns, the next URB in the
> 	endpoint's queue should get handled by the hardware.  If the
> 	endpoint is still halted, this URB should fail with -EPIPE
> 	just like the first.
> 
>     (3)	Etc.  Eventually the endpoint queue empties out or the class
> 	driver calls usb_clear_halt().

perfect :-)

> So (1) works as desired, but (2) doesn't work because the doorbell
> never gets rung.

right.

> And the easiest way to make (2) work is to issue a Reset Endpoint
> command.

There's one extra bit of information here, see below.

> (There are other, more complicated ways to get the same result.  For 
> instance, you could loop through the remaining queued URBs, giving them 
> back one by one with -EPIPE.  And each time an URB is submitted, you 
> could give it back right away.  But Reset Endpoint is simpler.)

IMO issuing a Reset Endpoint is the only correct way of implementing
this. See, there are two sides to usbtest + g_zero love relationship:

(a) it will help you test your UDC driver.
(b) it will help you test your HCD driver.

Currently, we have a bug with (b), but if iterate over the list of
submitted URBs we can create two problems:

(i) There's no way to synchronize class driver's usb_submit_urb() with
the exact time when we iterate over the list of pending URBs in
xhci-hcd. Which means that we could very well iterate over an empty list
and think everything was fine. Heck, this would even happen with
usbtest, note that simple_io() always waits 10 seconds for a transfer
completion before returning control to the caller.

(ii) We would fall into the possibility of not catching bugs with UDCs
running g_zero because HCD would just return early -EPIPE for us without
asking UDC to handle another transfer :-)

Because of these two cases, I think the *only* way to solve this is by
issuing a Reset Endpoint cmd so another token can be shifted onto the
data lines.

> In the patch, you talked about clearing the endpoint halt.  But that's 
> not what you want to do; you want to issue a Reset Endpoint command, 
> which affects only the host controller.  The endpoint's status in the 

this is exactly what xhci_cleanup_halted_endpoint() does. Sure, it's a
misnamer and should probably be renamed to either
xhci_reset_endpoint_cmd() or xhci_clear_endpoint_state_halted() to make
it clearer as to what's happening. But that can happen later, we don't
need to clean that up in order to fix the bug :-)

> peripheral device will remain unchanged -- no halt will be cleared.  
> That contributed to my confusion on reading the patch.

yeah, that got me for a while too. I had to keep reminding myself what
xhci_cleanup_halted_endpoint() was doing ;-)

> By the way, does the same sort of thing happen after a transfer
> error (such as a CRC mismatch)?  Does the xHCI controller change the 
> state to EP_STATE_HALTED?  Or does it instead go directly to 

There are a few conditions in which XHCI will change EP state to
EP_STATE_HALTED, one of them is a STALL token from the peripheral and
the others would be really error conditions: Babble, CRC error, etc.

The spec even has a note about it, which I quote:

	"
	A Halt condition or USB Transaction error detected on a USB pipe
	shall cause a Running Endpoint to transition to the Halted
	state. A Reset Endpoint Command shall be used to clear the Halt
	condition on the endpoint and transition the endpoint to the
	Stopped state. A Stop Endpoint Command received while an
	endpoint is in the Halted state shall have no effect and shall
	generate a Command Completion Event with the Completion Code set
	to Context State Error.
	"

> EP_STATE_STOPPED?  You probably want to treat that case and the STALL 
> case as similarly as possible.

There's already code to deal with that, take a look at
xhci_requires_manual_halt_cleanup() and its callers:

1754 static int xhci_requires_manual_halt_cleanup(struct xhci_hcd *xhci,
1755                 struct xhci_ep_ctx *ep_ctx,
1756                 unsigned int trb_comp_code)
1757 {
1758         /* TRB completion codes that may require a manual halt cleanup */
1759         if (trb_comp_code == COMP_TX_ERR ||
1760                         trb_comp_code == COMP_BABBLE ||
1761                         trb_comp_code == COMP_SPLIT_ERR)
1762                 /* The 0.96 spec says a babbling control endpoint
1763                  * is not halted. The 0.96 spec says it is.  Some HW
1764                  * claims to be 0.95 compliant, but it halts the control
1765                  * endpoint anyway.  Check if a babble halted the
1766                  * endpoint.
1767                  */
1768                 if ((ep_ctx->ep_info & cpu_to_le32(EP_STATE_MASK)) ==
1769                     cpu_to_le32(EP_STATE_HALTED))
1770                         return 1;
1771 
1772         return 0;
1773 }

In fact, there is a typo on that comment. It needs this patch:

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 4e8c3cf..a8bbacb 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1759,7 +1759,7 @@ static int xhci_requires_manual_halt_cleanup(struct xhci_hcd *xhci,
 	if (trb_comp_code == COMP_TX_ERR ||
 			trb_comp_code == COMP_BABBLE ||
 			trb_comp_code == COMP_SPLIT_ERR)
-		/* The 0.96 spec says a babbling control endpoint
+		/* The 0.95 spec says a babbling control endpoint
 		 * is not halted. The 0.96 spec says it is.  Some HW
 		 * claims to be 0.95 compliant, but it halts the control
 		 * endpoint anyway.  Check if a babble halted the

> > > For instance, if an endpoint is halted then there's no reason for the
> > > controller to shift any USB tokens for it onto the data lines.  Doing
> > > so would just be a waste of bandwidth, since the response is bound to
> > > be another STALL.  And it doesn't matter that the peripheral has no
> > > means to STALL any follow-up iens, since the host controller already
> > > knows the endpoint is halted.
> > 
> > Now you're claiming that this is a bug on usbtest which has been in tree
> > for many, many years and is known to work with EHCI, MUSB and UHCI (at
> > least, probably dummy too), which is a different statement from previous
> > thread [1].
> 
> No, I simply failed to understood what you wanted to do.

alright.

> > > The comment in the patch talks about moving the dequeue pointer past
> > > the STALLed TD and then clearing the halt condition.  Moving the
> > > dequeue pointer is fine -- there's no other way to take control of the
> > > TD back from the hardware -- but why would you want to clear the halt?  
> > > The HCD isn't supposed to do that; the class driver is.
> > 
> > See what usbtest does. It wants to make sure that, even if we issue
> > several URBs for that endpoint, the function will always STALL. Sure,
> > it's a waste of bandwidth, but what's the probability that any class
> > driver will actually do this outside of a test environment ? I think
> > it's not up to the HCD to device and it should, rather, let the function
> > respond with the expected STALL again which will, once more, move the
> > endpoint back into EP_STATE_HALT.
> > 
> > The only thing we should be discussing here, is proper placement for
> > xhci_cleanup_halted_endpoint().
> 
> Right.  In theory you could do it any time up until the completion
> routine returns.  Doing it when you process the failed TD seems like a
> good choice -- advance the dequeue pointer and issue the command at the
> same time.

My concern here is that this will happen when the first usb_submit_urb()
completes. I wonder if moving this to when the following
usb_submit_urb() is about to start would be better ?

Something like below (it won't build, just trying to illustrate the
situation):

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 4e8c3cf..a1dac2c 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2814,6 +2814,12 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,
 		return -EINVAL;
 	case EP_STATE_HALTED:
 		xhci_dbg(xhci, "WARN halted endpoint, queueing URB anyway.\n");
+		if (!usb_endpoint_xfer_control(&td->urb->ep->desc)) {
+			xhci_dbg(xhci, "Resetting endpoint.\n");
+			xhci_cleanup_halted_endpoint(xhci,
+					slot_id, ep_index, ep_ring->stream_id,
+					td, event_trb);
+		}
 	case EP_STATE_STOPPED:
 	case EP_STATE_RUNNING:
 		break;


I think this has a higher probability of being correct. Class driver
might not queue any URB to a particular after the first Stall, so why
would be move the endpoint away from EP_STATE_HALTED prematurely ?

What do you think ?

-- 
balbi
Attachment:
signature.asc

Description: Digital signature