Re: [PATCH 1/1 v2] xHCI 1.0: Endpoint Not Enabled Error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Alex, see my answers below.

Alan,

All of this is a lot of work for the xHCI driver, and it seems like
other host drivers also could suffer from active URBs being outstanding
at inopportune times due to buggy drivers.  Would it be possible for the
USB core to cancel any outstanding URBs before resetting a device,
changing a configuration or alternate interface setting, or deallocating
a device?  It seems like that work really needs to be done in a central
location.

On Mon, Jun 20, 2011 at 06:33:22PM +0800, Alex He wrote:
> Hi Sarah,
> 
> I have some questions about it. You know I don't know xhci driver well.
> Please correct me if I was wrong.
> 
> 
> On Sat, 2011-06-18 at 01:12 +0800, Sarah Sharp wrote:
> > Hi Alex,
> > 
> > I thought at first that this patch was correct, but looking at it
> > further, I think this patch will leak URBs.
> > 
> > On Wed, Jun 08, 2011 at 06:25:50PM +0800, Alex He wrote:
> > >       /* Don't ring the doorbell for this endpoint if there are
> > pending
> > >        * cancellations because we don't want to interrupt
> > processing.
> > > @@ -1995,6 +2005,10 @@ static int handle_tx_event(struct xhci_hcd
> > *xhci,
> > >       case COMP_BUFF_OVER:
> > >               xhci_warn(xhci, "WARN: buffer overrun event on
> > endpoint\n");
> > >               break;
> > > +     case COMP_EBADEP:
> > > +             xhci_warn(xhci, "WARN: Doorbell rung for disabled slot
> > "
> > > +                             "or endpoint\n");
> > > +             goto cleanup;
> > 
> > What happens if we've placed a TD on a disabled endpoint ring and rung
> > the doorbell?  If you go directly down to cleanup, ret will still be
> > zero, and the TD will remain on the TD list:
> 
> I saw the xhci_urb_enqueue() (call prepare_ring()) has checked the
> ep_state to prevent to add TD to ep_ring for DISABLED EP.

Yes, but there's a race condition between the hardware and software
state.  The configure endpoint command can race with URB submission.

Say a driver submits an URB for a bulk endpoint, and then immediately
requests a device reset.  The xHCI driver will see the endpoint state is
enabled, ring the doorbell, and then go on to process the device reset
request by issuing a Reset Device command.  Since bulk endpoints are
best effort, it's possible, although not very plausible that the host
controller could process the command before servicing the bulk endpoint
ring.  In that case, the host finds the endpoint is disabled and gives
us the Disabled Endpoint completion code.

I would have to think about whether such a race condition could actually
*happen* with the current code.  We had a discussion a while back that
drivers are not required to cancel URBs before requesting a device
reset.  Alan Stern amended the kernel documentation to say that they
should, but it's possible some driver could trigger this race condition.

> > The ring code will get stuck if you leave the TD on the td_list, and
> > the
> > driver will eventually ask you to cancel the URB, and then we get into
> > further issues with the cancellation code and disabled endpoints.  I
> > think you need to allow the code to try to find the TD and its URB, so
> > that it can be given back to the driver with an error return code.
> > Possibly -EPROTO?  We don't want to return -ENODEV, because the device
> > might still be there, just not that particular endpoint.
> > 
> > I think to fix this, you need to add handling for the error code to
> > the
> > various process_*_td functions that sets the status code (or the frame
> > status code for isochronous endpoints) to -EPROTO.
> > 
> The section 4.7 of the xHCI spec says that the Endpoint Not Enable Error
> Event TRB with TRB Pointer, TRB Transfer Length, Event Data(ED) fields
> set to '0'. It means that this Event is not interrelated with any TDs
> directly. So we don't know the factual situation of the ep_ring, i.e.
> which URB of the TD can be used to return -EPROTO? Can it be the first
> TD in the ep_ring->td_list or all TDs in the ep_ring->td_list even they
> hasn't be transfered?

I would say if the endpoint is still present at all, we need to give
back -ENOENT for all the TDs on the td_list.  But after thinking about
this further, I'm not sure it's appropriate to giveback those URBs in
the ring handling code.  I would have to think about it really hard to
figure out if there are any race conditions there, since we couldn't
actually submit a stop endpoint command to take the TDs off the ring.

The only way an endpoint can transition to the disabled state is if the
driver has submitted a Configure Endpoint, a Reset Device, or a Disable
Slot command.  I think any URBs left on the endpoint rings should be
handled in the completions for those commands (and a big fat warning
should be printed if URBs are active for those rings).

When endpoints are dropped or changed by the Configure Endpoint command,
the xHCI driver installs a new endpoint ring and calls
xhci_free_or_cache_endpoint_ring().  You should can give back any URBs
in the td_list with -ENOENT in there. You'll need to both giveback the
URB, and ensure proper ref counting for any isochronous endpoints
(decrementing bandwidth_isoc_reqs with xhci->lock held, since it isn't
an atomic variable) like xhci_giveback_urb_in_irq() does.

You can't actually use that function, since it assumes you're in an
interrupt context and uses spin_lock() instead of spin_lock_irqsave(),
and the bandwidth function runs in process context.  But you could
refactor the function a bit so that it calls a different function that
has the option of using either interrupt or process type locking.  That
function would have the bulk of the code in xhci_giveback_urb_in_irq(),
with an if statement when it comes to giveback the URB.

Once a Reset Device command completes, the driver also calls
xhci_free_or_cache_endpoint_ring().  We also handle the case where the
endpoint had streams enabled, so xhci_free_stream_info() probably needs
to be modified to giveback any URBs that are on each of the stream
ring's td_lists.

The Disable Slot command is a bit harder to handle, since it completes
only in interrupt context, and calls xhci_free_virt_device(), which is
called from both interrupt and process context.  Probably you should
giveback any URBs before submitting the command, since that command
can't fail unless we have already disabled the slot.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux