Re: [RFC/PATCH v2] usb: host: xhci: issue 'Reset Endpoint' for !CONTROL EPs from finish_td()

Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> · Thu, 13 Nov 2014 17:30:44 +0200

Hi

On 13.11.2014 17:11, Felipe Balbi wrote:
> Hi,
> 
> On Thu, Nov 13, 2014 at 01:16:34PM +0200, Mathias Nyman wrote:
>> On 13.11.2014 00:28, Felipe Balbi wrote:
>>> Hi,
>>>
>>> On Wed, Nov 12, 2014 at 04:54:06PM -0500, Alan Stern wrote:
>>>
>>> [...]
>>>
>>>>> and the doorbell will never rung. But even if I drop EP_HALTED from the
>>>>> check below and let EP doorbell be rung, nothing will happen because,
>>>>> according to XHCI spec 1.0, we *must* first issue a Reset Endpoint
>>>>> command to get Endpoint to move to EP_STATE_STOPPED, in which case,
>>>>> ringing that EP's doorbell will actually cause a transfer to happen.
>>>>>
>>>>> Right now, what happens is that second usb_submit_urb() does nothing and
>>>>> the 10 second timer expires, causing the URB to be canceled and test
>>>>> failing with -ETIMEDOUT.
>>>>
>>>> Okay, I see.  What usbcore and usbtest expect to happen is this:
>>>>
>>>>     (1)	An URB fails because the endpoint sent a STALL.  The completion
>>>> 	routine is called with status -EPIPE.
>>>>
>>>>     (2)	When the completion routine returns, the next URB in the
>>>> 	endpoint's queue should get handled by the hardware.  If the
>>>> 	endpoint is still halted, this URB should fail with -EPIPE
>>>> 	just like the first.
>>>>
>>>>     (3)	Etc.  Eventually the endpoint queue empties out or the class
>>>> 	driver calls usb_clear_halt().
>>>
>>> perfect :-)
>>>
>>>> So (1) works as desired, but (2) doesn't work because the doorbell
>>>> never gets rung.
>>>
>>> right.
>>>
>>>> And the easiest way to make (2) work is to issue a Reset Endpoint
>>>> command.
>>>
>>> There's one extra bit of information here, see below.
>>>
>>>> (There are other, more complicated ways to get the same result.  For 
>>>> instance, you could loop through the remaining queued URBs, giving them 
>>>> back one by one with -EPIPE.  And each time an URB is submitted, you 
>>>> could give it back right away.  But Reset Endpoint is simpler.)
>>>
>>> IMO issuing a Reset Endpoint is the only correct way of implementing
>>> this. See, there are two sides to usbtest + g_zero love relationship:
>>>
>>> (a) it will help you test your UDC driver.
>>> (b) it will help you test your HCD driver.
>>>
>>> Currently, we have a bug with (b), but if iterate over the list of
>>> submitted URBs we can create two problems:
>>>
>>> (i) There's no way to synchronize class driver's usb_submit_urb() with
>>> the exact time when we iterate over the list of pending URBs in
>>> xhci-hcd. Which means that we could very well iterate over an empty list
>>> and think everything was fine. Heck, this would even happen with
>>> usbtest, note that simple_io() always waits 10 seconds for a transfer
>>> completion before returning control to the caller.
>>>
>>> (ii) We would fall into the possibility of not catching bugs with UDCs
>>> running g_zero because HCD would just return early -EPIPE for us without
>>> asking UDC to handle another transfer :-)
>>>
>>> Because of these two cases, I think the *only* way to solve this is by
>>> issuing a Reset Endpoint cmd so another token can be shifted onto the
>>> data lines.
>>>
>>>> In the patch, you talked about clearing the endpoint halt.  But that's 
>>>> not what you want to do; you want to issue a Reset Endpoint command, 
>>>> which affects only the host controller.  The endpoint's status in the 
>>>
>>> this is exactly what xhci_cleanup_halted_endpoint() does. Sure, it's a
>>> misnamer and should probably be renamed to either
>>> xhci_reset_endpoint_cmd() or xhci_clear_endpoint_state_halted() to make
>>> it clearer as to what's happening. But that can happen later, we don't
>>> need to clean that up in order to fix the bug :-)
>>>
>>>> peripheral device will remain unchanged -- no halt will be cleared.  
>>>> That contributed to my confusion on reading the patch.
>>>
>>> yeah, that got me for a while too. I had to keep reminding myself what
>>> xhci_cleanup_halted_endpoint() was doing ;-)
>>>
>>>> By the way, does the same sort of thing happen after a transfer
>>>> error (such as a CRC mismatch)?  Does the xHCI controller change the 
>>>> state to EP_STATE_HALTED?  Or does it instead go directly to 
>>>
>>> There are a few conditions in which XHCI will change EP state to
>>> EP_STATE_HALTED, one of them is a STALL token from the peripheral and
>>> the others would be really error conditions: Babble, CRC error, etc.
>>>
>>> The spec even has a note about it, which I quote:
>>>
>>> 	"
>>> 	A Halt condition or USB Transaction error detected on a USB pipe
>>> 	shall cause a Running Endpoint to transition to the Halted
>>> 	state. A Reset Endpoint Command shall be used to clear the Halt
>>> 	condition on the endpoint and transition the endpoint to the
>>> 	Stopped state. A Stop Endpoint Command received while an
>>> 	endpoint is in the Halted state shall have no effect and shall
>>> 	generate a Command Completion Event with the Completion Code set
>>> 	to Context State Error.
>>> 	"
>>>
>>>> EP_STATE_STOPPED?  You probably want to treat that case and the STALL 
>>>> case as similarly as possible.
>>>
>>> There's already code to deal with that, take a look at
>>> xhci_requires_manual_halt_cleanup() and its callers:
>>>
>>> 1754 static int xhci_requires_manual_halt_cleanup(struct xhci_hcd *xhci,
>>> 1755                 struct xhci_ep_ctx *ep_ctx,
>>> 1756                 unsigned int trb_comp_code)
>>> 1757 {
>>> 1758         /* TRB completion codes that may require a manual halt cleanup */
>>> 1759         if (trb_comp_code == COMP_TX_ERR ||
>>> 1760                         trb_comp_code == COMP_BABBLE ||
>>> 1761                         trb_comp_code == COMP_SPLIT_ERR)
>>> 1762                 /* The 0.96 spec says a babbling control endpoint
>>> 1763                  * is not halted. The 0.96 spec says it is.  Some HW
>>> 1764                  * claims to be 0.95 compliant, but it halts the control
>>> 1765                  * endpoint anyway.  Check if a babble halted the
>>> 1766                  * endpoint.
>>> 1767                  */
>>> 1768                 if ((ep_ctx->ep_info & cpu_to_le32(EP_STATE_MASK)) ==
>>> 1769                     cpu_to_le32(EP_STATE_HALTED))
>>> 1770                         return 1;
>>> 1771 
>>> 1772         return 0;
>>> 1773 }
>>>
>>> In fact, there is a typo on that comment. It needs this patch:
>>>
>>> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
>>> index 4e8c3cf..a8bbacb 100644
>>> --- a/drivers/usb/host/xhci-ring.c
>>> +++ b/drivers/usb/host/xhci-ring.c
>>> @@ -1759,7 +1759,7 @@ static int xhci_requires_manual_halt_cleanup(struct xhci_hcd *xhci,
>>>  	if (trb_comp_code == COMP_TX_ERR ||
>>>  			trb_comp_code == COMP_BABBLE ||
>>>  			trb_comp_code == COMP_SPLIT_ERR)
>>> -		/* The 0.96 spec says a babbling control endpoint
>>> +		/* The 0.95 spec says a babbling control endpoint
>>>  		 * is not halted. The 0.96 spec says it is.  Some HW
>>>  		 * claims to be 0.95 compliant, but it halts the control
>>>  		 * endpoint anyway.  Check if a babble halted the
>>>
>>>>>> For instance, if an endpoint is halted then there's no reason for the
>>>>>> controller to shift any USB tokens for it onto the data lines.  Doing
>>>>>> so would just be a waste of bandwidth, since the response is bound to
>>>>>> be another STALL.  And it doesn't matter that the peripheral has no
>>>>>> means to STALL any follow-up iens, since the host controller already
>>>>>> knows the endpoint is halted.
>>>>>
>>>>> Now you're claiming that this is a bug on usbtest which has been in tree
>>>>> for many, many years and is known to work with EHCI, MUSB and UHCI (at
>>>>> least, probably dummy too), which is a different statement from previous
>>>>> thread [1].
>>>>
>>>> No, I simply failed to understood what you wanted to do.
>>>
>>> alright.
>>>
>>>>>> The comment in the patch talks about moving the dequeue pointer past
>>>>>> the STALLed TD and then clearing the halt condition.  Moving the
>>>>>> dequeue pointer is fine -- there's no other way to take control of the
>>>>>> TD back from the hardware -- but why would you want to clear the halt?  
>>>>>> The HCD isn't supposed to do that; the class driver is.
>>>>>
>>>>> See what usbtest does. It wants to make sure that, even if we issue
>>>>> several URBs for that endpoint, the function will always STALL. Sure,
>>>>> it's a waste of bandwidth, but what's the probability that any class
>>>>> driver will actually do this outside of a test environment ? I think
>>>>> it's not up to the HCD to device and it should, rather, let the function
>>>>> respond with the expected STALL again which will, once more, move the
>>>>> endpoint back into EP_STATE_HALT.
>>>>>
>>>>> The only thing we should be discussing here, is proper placement for
>>>>> xhci_cleanup_halted_endpoint().
>>>>
>>>> Right.  In theory you could do it any time up until the completion
>>>> routine returns.  Doing it when you process the failed TD seems like a
>>>> good choice -- advance the dequeue pointer and issue the command at the
>>>> same time.
>>>
>>> My concern here is that this will happen when the first usb_submit_urb()
>>> completes. I wonder if moving this to when the following
>>> usb_submit_urb() is about to start would be better ?
>>>
>>> Something like below (it won't build, just trying to illustrate the
>>> situation):
>>>
>>> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
>>> index 4e8c3cf..a1dac2c 100644
>>> --- a/drivers/usb/host/xhci-ring.c
>>> +++ b/drivers/usb/host/xhci-ring.c
>>> @@ -2814,6 +2814,12 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,
>>>  		return -EINVAL;
>>>  	case EP_STATE_HALTED:
>>>  		xhci_dbg(xhci, "WARN halted endpoint, queueing URB anyway.\n");
>>> +		if (!usb_endpoint_xfer_control(&td->urb->ep->desc)) {
>>> +			xhci_dbg(xhci, "Resetting endpoint.\n");
>>> +			xhci_cleanup_halted_endpoint(xhci,
>>> +					slot_id, ep_index, ep_ring->stream_id,
>>> +					td, event_trb);
>>> +		}
>>>  	case EP_STATE_STOPPED:
>>>  	case EP_STATE_RUNNING:
>>>  		break;
>>>
>>>
>>> I think this has a higher probability of being correct. Class driver
>>> might not queue any URB to a particular after the first Stall, so why
>>> would be move the endpoint away from EP_STATE_HALTED prematurely ?
>>>
>>> What do you think ?
>>>
>>
>> Currently we queue a reset endpoint command from the .endpoint_reset
>> callback in host, this is far too late and should be moved to when we
>> get a STALL event.  
>>
>> xhci needs to reset control endpoints on stall as well [1]
>>
>> I got a testpatch for this, but the more I look into how we handle
>> reset endpoint for clearing halts, stop endpoint for urb dequeue, and
>> reset device, the more I notice that there are several other cases
>> that needs fixing. testpatch for the halted ep is here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/mnyman/xhci.git/commit/?h=ep_reset_halt_test&id=fe43d559e0816f65e5373e863a7da8062d311cd7
> 
> this works too. Can you just make sure to Cc: stable for v3.14+ or
> older, if you know which is the earliest kernel that needs this fix.
> 
> You also want to update the comments on that function so they match
> what's done now.

Sure, will do. 
I'll remove the queueing of a reset endpoint command in the .endpoint_reset
callback as well.

The comment for xhci_endpoint_reset says its called in_interrupt context?
Does the usb core really call the .endpoint_reset in interrupt
context? I tried to follow the codepaths in the usb core but couln't figure it out  

-Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html