Re: [RFC/PATCH v2] usb: host: xhci: issue 'Reset Endpoint' for !CONTROL EPs from finish_td()

Felipe Balbi <balbi@xxxxxx> · Thu, 13 Nov 2014 09:11:25 -0600

Hi,

On Thu, Nov 13, 2014 at 01:16:34PM +0200, Mathias Nyman wrote:
> On 13.11.2014 00:28, Felipe Balbi wrote:
> > Hi,
> > 
> > On Wed, Nov 12, 2014 at 04:54:06PM -0500, Alan Stern wrote:
> > 
> > [...]
> > 
> >>> and the doorbell will never rung. But even if I drop EP_HALTED from the
> >>> check below and let EP doorbell be rung, nothing will happen because,
> >>> according to XHCI spec 1.0, we *must* first issue a Reset Endpoint
> >>> command to get Endpoint to move to EP_STATE_STOPPED, in which case,
> >>> ringing that EP's doorbell will actually cause a transfer to happen.
> >>>
> >>> Right now, what happens is that second usb_submit_urb() does nothing and
> >>> the 10 second timer expires, causing the URB to be canceled and test
> >>> failing with -ETIMEDOUT.
> >>
> >> Okay, I see.  What usbcore and usbtest expect to happen is this:
> >>
> >>     (1)	An URB fails because the endpoint sent a STALL.  The completion
> >> 	routine is called with status -EPIPE.
> >>
> >>     (2)	When the completion routine returns, the next URB in the
> >> 	endpoint's queue should get handled by the hardware.  If the
> >> 	endpoint is still halted, this URB should fail with -EPIPE
> >> 	just like the first.
> >>
> >>     (3)	Etc.  Eventually the endpoint queue empties out or the class
> >> 	driver calls usb_clear_halt().
> > 
> > perfect :-)
> > 
> >> So (1) works as desired, but (2) doesn't work because the doorbell
> >> never gets rung.
> > 
> > right.
> > 
> >> And the easiest way to make (2) work is to issue a Reset Endpoint
> >> command.
> > 
> > There's one extra bit of information here, see below.
> > 
> >> (There are other, more complicated ways to get the same result.  For 
> >> instance, you could loop through the remaining queued URBs, giving them 
> >> back one by one with -EPIPE.  And each time an URB is submitted, you 
> >> could give it back right away.  But Reset Endpoint is simpler.)
> > 
> > IMO issuing a Reset Endpoint is the only correct way of implementing
> > this. See, there are two sides to usbtest + g_zero love relationship:
> > 
> > (a) it will help you test your UDC driver.
> > (b) it will help you test your HCD driver.
> > 
> > Currently, we have a bug with (b), but if iterate over the list of
> > submitted URBs we can create two problems:
> > 
> > (i) There's no way to synchronize class driver's usb_submit_urb() with
> > the exact time when we iterate over the list of pending URBs in
> > xhci-hcd. Which means that we could very well iterate over an empty list
> > and think everything was fine. Heck, this would even happen with
> > usbtest, note that simple_io() always waits 10 seconds for a transfer
> > completion before returning control to the caller.
> > 
> > (ii) We would fall into the possibility of not catching bugs with UDCs
> > running g_zero because HCD would just return early -EPIPE for us without
> > asking UDC to handle another transfer :-)
> > 
> > Because of these two cases, I think the *only* way to solve this is by
> > issuing a Reset Endpoint cmd so another token can be shifted onto the
> > data lines.
> > 
> >> In the patch, you talked about clearing the endpoint halt.  But that's 
> >> not what you want to do; you want to issue a Reset Endpoint command, 
> >> which affects only the host controller.  The endpoint's status in the 
> > 
> > this is exactly what xhci_cleanup_halted_endpoint() does. Sure, it's a
> > misnamer and should probably be renamed to either
> > xhci_reset_endpoint_cmd() or xhci_clear_endpoint_state_halted() to make
> > it clearer as to what's happening. But that can happen later, we don't
> > need to clean that up in order to fix the bug :-)
> > 
> >> peripheral device will remain unchanged -- no halt will be cleared.  
> >> That contributed to my confusion on reading the patch.
> > 
> > yeah, that got me for a while too. I had to keep reminding myself what
> > xhci_cleanup_halted_endpoint() was doing ;-)
> > 
> >> By the way, does the same sort of thing happen after a transfer
> >> error (such as a CRC mismatch)?  Does the xHCI controller change the 
> >> state to EP_STATE_HALTED?  Or does it instead go directly to 
> > 
> > There are a few conditions in which XHCI will change EP state to
> > EP_STATE_HALTED, one of them is a STALL token from the peripheral and
> > the others would be really error conditions: Babble, CRC error, etc.
> > 
> > The spec even has a note about it, which I quote:
> > 
> > 	"
> > 	A Halt condition or USB Transaction error detected on a USB pipe
> > 	shall cause a Running Endpoint to transition to the Halted
> > 	state. A Reset Endpoint Command shall be used to clear the Halt
> > 	condition on the endpoint and transition the endpoint to the
> > 	Stopped state. A Stop Endpoint Command received while an
> > 	endpoint is in the Halted state shall have no effect and shall
> > 	generate a Command Completion Event with the Completion Code set
> > 	to Context State Error.
> > 	"
> > 
> >> EP_STATE_STOPPED?  You probably want to treat that case and the STALL 
> >> case as similarly as possible.
> > 
> > There's already code to deal with that, take a look at
> > xhci_requires_manual_halt_cleanup() and its callers:
> > 
> > 1754 static int xhci_requires_manual_halt_cleanup(struct xhci_hcd *xhci,
> > 1755                 struct xhci_ep_ctx *ep_ctx,
> > 1756                 unsigned int trb_comp_code)
> > 1757 {
> > 1758         /* TRB completion codes that may require a manual halt cleanup */
> > 1759         if (trb_comp_code == COMP_TX_ERR ||
> > 1760                         trb_comp_code == COMP_BABBLE ||
> > 1761                         trb_comp_code == COMP_SPLIT_ERR)
> > 1762                 /* The 0.96 spec says a babbling control endpoint
> > 1763                  * is not halted. The 0.96 spec says it is.  Some HW
> > 1764                  * claims to be 0.95 compliant, but it halts the control
> > 1765                  * endpoint anyway.  Check if a babble halted the
> > 1766                  * endpoint.
> > 1767                  */
> > 1768                 if ((ep_ctx->ep_info & cpu_to_le32(EP_STATE_MASK)) ==
> > 1769                     cpu_to_le32(EP_STATE_HALTED))
> > 1770                         return 1;
> > 1771 
> > 1772         return 0;
> > 1773 }
> > 
> > In fact, there is a typo on that comment. It needs this patch:
> > 
> > diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> > index 4e8c3cf..a8bbacb 100644
> > --- a/drivers/usb/host/xhci-ring.c
> > +++ b/drivers/usb/host/xhci-ring.c
> > @@ -1759,7 +1759,7 @@ static int xhci_requires_manual_halt_cleanup(struct xhci_hcd *xhci,
> >  	if (trb_comp_code == COMP_TX_ERR ||
> >  			trb_comp_code == COMP_BABBLE ||
> >  			trb_comp_code == COMP_SPLIT_ERR)
> > -		/* The 0.96 spec says a babbling control endpoint
> > +		/* The 0.95 spec says a babbling control endpoint
> >  		 * is not halted. The 0.96 spec says it is.  Some HW
> >  		 * claims to be 0.95 compliant, but it halts the control
> >  		 * endpoint anyway.  Check if a babble halted the
> > 
> >>>> For instance, if an endpoint is halted then there's no reason for the
> >>>> controller to shift any USB tokens for it onto the data lines.  Doing
> >>>> so would just be a waste of bandwidth, since the response is bound to
> >>>> be another STALL.  And it doesn't matter that the peripheral has no
> >>>> means to STALL any follow-up iens, since the host controller already
> >>>> knows the endpoint is halted.
> >>>
> >>> Now you're claiming that this is a bug on usbtest which has been in tree
> >>> for many, many years and is known to work with EHCI, MUSB and UHCI (at
> >>> least, probably dummy too), which is a different statement from previous
> >>> thread [1].
> >>
> >> No, I simply failed to understood what you wanted to do.
> > 
> > alright.
> > 
> >>>> The comment in the patch talks about moving the dequeue pointer past
> >>>> the STALLed TD and then clearing the halt condition.  Moving the
> >>>> dequeue pointer is fine -- there's no other way to take control of the
> >>>> TD back from the hardware -- but why would you want to clear the halt?  
> >>>> The HCD isn't supposed to do that; the class driver is.
> >>>
> >>> See what usbtest does. It wants to make sure that, even if we issue
> >>> several URBs for that endpoint, the function will always STALL. Sure,
> >>> it's a waste of bandwidth, but what's the probability that any class
> >>> driver will actually do this outside of a test environment ? I think
> >>> it's not up to the HCD to device and it should, rather, let the function
> >>> respond with the expected STALL again which will, once more, move the
> >>> endpoint back into EP_STATE_HALT.
> >>>
> >>> The only thing we should be discussing here, is proper placement for
> >>> xhci_cleanup_halted_endpoint().
> >>
> >> Right.  In theory you could do it any time up until the completion
> >> routine returns.  Doing it when you process the failed TD seems like a
> >> good choice -- advance the dequeue pointer and issue the command at the
> >> same time.
> > 
> > My concern here is that this will happen when the first usb_submit_urb()
> > completes. I wonder if moving this to when the following
> > usb_submit_urb() is about to start would be better ?
> > 
> > Something like below (it won't build, just trying to illustrate the
> > situation):
> > 
> > diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> > index 4e8c3cf..a1dac2c 100644
> > --- a/drivers/usb/host/xhci-ring.c
> > +++ b/drivers/usb/host/xhci-ring.c
> > @@ -2814,6 +2814,12 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,
> >  		return -EINVAL;
> >  	case EP_STATE_HALTED:
> >  		xhci_dbg(xhci, "WARN halted endpoint, queueing URB anyway.\n");
> > +		if (!usb_endpoint_xfer_control(&td->urb->ep->desc)) {
> > +			xhci_dbg(xhci, "Resetting endpoint.\n");
> > +			xhci_cleanup_halted_endpoint(xhci,
> > +					slot_id, ep_index, ep_ring->stream_id,
> > +					td, event_trb);
> > +		}
> >  	case EP_STATE_STOPPED:
> >  	case EP_STATE_RUNNING:
> >  		break;
> > 
> > 
> > I think this has a higher probability of being correct. Class driver
> > might not queue any URB to a particular after the first Stall, so why
> > would be move the endpoint away from EP_STATE_HALTED prematurely ?
> > 
> > What do you think ?
> > 
> 
> Currently we queue a reset endpoint command from the .endpoint_reset
> callback in host, this is far too late and should be moved to when we
> get a STALL event.  
> 
> xhci needs to reset control endpoints on stall as well [1]
> 
> I got a testpatch for this, but the more I look into how we handle
> reset endpoint for clearing halts, stop endpoint for urb dequeue, and
> reset device, the more I notice that there are several other cases
> that needs fixing. testpatch for the halted ep is here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/mnyman/xhci.git/commit/?h=ep_reset_halt_test&id=fe43d559e0816f65e5373e863a7da8062d311cd7

this works too. Can you just make sure to Cc: stable for v3.14+ or
older, if you know which is the earliest kernel that needs this fix.

You also want to update the comments on that function so they match
what's done now.

> It's hard to see from patch diff itself what it does, but basically we
> call xhci_cleanup_halted_endpoint() in finish_td() if the transfer
> event status is STALL, or a TX error that requires resetting the
> endpoint.
> 
> There are still issues with setting the dequeue pointer correctly
> after stop or reset endpoint, I think this is because we try to find
> the next TD based on a saved "stopped TD" value that might not valid
> anymore (i.e. a reset device in between reset endpoint and set dq
> pointer) this issue is seen with DVB tuners when changing channels:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=75521

alright. Thanks

-- 
balbi
Attachment:
signature.asc

Description: Digital signature