Re: XHCI: Ring expansion failure

Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx> · Wed, 18 Dec 2013 11:07:39 -0800

On Wed, Nov 06, 2013 at 11:09:50PM -0000, hemantk@xxxxxxxxxxxxxx wrote:
> >> On Wed, Nov 6, 2013 at 12:33 PM,  <hemantk@xxxxxxxxxxxxxx> wrote:
> >>>> On Wed, Nov 6, 2013 at 12:53 AM,  <hemantk@xxxxxxxxxxxxxx> wrote:
> >>>>> Hi

Hi Hemant,

> >>>>> By performing iterative port suspend and resume (which results
> >>>>> in function suspend and resume), ring expansion failure is
> >>>>> observed. Attached device has multiple interfaces for which
> >>>>> interface host drivers are unlinking the urbs during function
> >>>>> suspend and submitting urbs during resume.
> >>>>>
> >>>>> For the cases when dequeue ptr is close to the end of deq
> >>>>> segment, value of num_trbs_in_deq_seg (= ring->dequeue -
> >>>>> ring->deq_seg->trbs) becomes high when function suspend is over.
> >>>>> At the time of resume this high value is causing to trigger ring
> >>>>> expansion in the middle of submitting urbs(ring->num_trbs_free <
> >>>>> num_trbs + num_trbs_in_deq_seg).  All the interface host drivers
> >>>>> only request one TRB per urb (num_trb =1).  Ring expansion logic
> >>>>> is doubling the num of previously allocated segments every time.
> >>>>> This is causing the num of segments to grow at fast pace (at the
> >>>>> time of failure, one of the ep rings num_seg was 128), even
> >>>>> though it is also increasing num_trbs_free.

Sorry for taking so long to respond to this.  It's definitely something
we need to look into.

Which type of USB device caused this out-of-control ring expansion?  Can
you provide us a link for where to buy it, if it's a specific device?
It would make it easier for us to test this.

> >>>> Sounds like a bug. Why num_trbs_in_deq_seg becomes high? How many urbs
> >>>> are submitted during resume? Normally only isoc transfer triggers ring
> >>>> expansion with multiple trbs per urb.
> >>>
> >>> Let me explain it little further what is being observed in terms
> >>> of enq , deq ptr:-
> >>>
> >>> For example
> >>> 1)      Let's say First seg of the ring starts with 0xed0d3000
> >>> which means enq ptr and deq ptr are pointing to this address.
> >>> 2)      Interface driver queuing 10 urbs (one urb represents one
> >>> trb) enq ptr goes to ed0d30a0.
> >>> 3)      Interface suspend happened. All 10 urbs are going to get
> >>> unlinked. For the first urb to be unlinked, it is added to
> >>> ep->cancelled_td_list  and Stop ep cmd was called (rest of the
> >>> urbs added to ep->cancelled_td_list before stop ep cmd completed).
> >>> 4)      When stop ep cmd is completed, for 9 urbs td_to_noop() is
> >>> called.  For the first urb (for which stop ep cmd was called),
> >>> xhci_find_new_dequeue_state() gets called which updates deq ptr to
> >>> next trb (0x ed0d3010) so set tr deq ptr cmd is issued to xhci to
> >>> set its deq ptr to next trb(0xed0d3010).
> >>> 5)      In set tr deq cmd completion handler
> >>> update_ring_for_set_deq_completion() gets called which will just
> >>> increments num_trbs_free to only one, but we unlinked all the
> >>> urbs. So when interface suspend finished we have enq ptr pointing
> >>> to 0x ed0d30a0 and dep ptr pointing to 0x ed0d3010. With
> >>> num_trbs_free only incremented by one.
> >>> 6)      At the time of interface resume we re-queue 10 urbs back
> >>> with num_trbs_free (just incremented 1 during suspend). This means
> >>> num_trbs_free is incrementing very slowly every time interface
> >>> suspend happens. And it is very easy to get into a situation where
> >>> room_on_ring() returning 0 as deq ptr will be reaching to the end
> >>> of the seg and causing num_trbs_in_deq_seg to go high enough that
> >>> ring->num_trbs_free < num_trbs + num_trbs_in_deq_seg condition
> >>> becomes true.
> >>>
> >>
> >> So it's because the suspend/resume loops and urbs are enqueued and
> >> unlinked, deq_ptr does not follow enq_ptr. This is expected since
> >> deq_ptr and num_trbs_free increase in inc_deq(), which is called by
> >> irq handler. As there is no transfer on this ring, deq_ptr will not
> >> get updated.

Yep, I think this is the root cause.  If we're just queuing and then
canceling URBs, the no-op TRBs get executed, but we don't get an event
back for them, so the dequeue pointer remains on the first canceled TRB.

> > Considering the suspend resume scenario, with this behavior don't you
> > think ring expansion will be triggered very often and for no reason (as we
> > are just doing suspend and resume).
> >
> > In handle_stopped_endpoint() when traversing to ep->cancelled_td_list is
> > it better to call xhci_find_new_dequeue_state() for last td and call
> > td_to_noop() for all previous tds. Doing that we will update deq ptr to
> > set to enq ptr with correct value of num_trbs_free when interface suspends
> > finishes.

There's no guarantee that the last URB queued for cancellation is the
last TD physically in the ring.  The URBs could be queued for
cancellation in any order, and thus go on to cancelled_td_list in any
order.

We would also have issues using this solution for endpoints with streams
enabled, where there are multiple stream rings associated with one
endpoint.  The current code assumes that the host controller has stopped
on one particular endpoint ring, which may need a Set TR dequeue
command, and other rings will only have TDs turned into no-op TRBs.
Otherwise the current code that stores the dequeue in a local pointer
and issues one command to move the dequeue pointer won't work.

> i think instead of what suggested above, we need to somehow update
> num_trbs_free to the correct value any time before interface driver is
> going to submit the first urb as a result of interface resume. As after
> resume xhci is going to skip no op trbs and starts fetching the transfer
> trbs on the ring.

That doesn't help the case where the driver is canceling URBs as part of
normal operation, not suspend and resume.  I would rather fix this for
the generic case than focus on suspend in particular.

> >> I'm still curious about the failure of ring expansion. 128 segments
> >> occupy 128KB, which should not be a big deal. Ring expansion is
> >> expected to run until there is no memory to allocate. Does it fail
> >> because of allocation failure?
> > yes, xhci_ring_expansion() returns -ENOMEM. There are 19 rings allocated.
> >>>
> >>>>> Even for the cases where we observe data transfer as well as
> >>>>> suspend & resume (which are normal scenarios) ring expansion
> >>>>> failure is easily observed due to large number of segments being
> >>>>> allocated for ever.  This means triggering of ring expansion
> >>>>> will result in failure of further expansion at some point of
> >>>>> time.
> >>>>>
> >>>>> I was trying to think of some options to reduce ring expansion
> >>>>> with respect to suspend resume scenario: -
> >>>>> 1)      Can we reset enq and deq ptr to point to ep_ring first
> >>>>> seg once all the urbs are unlinked (function suspend)?

You would have to re-initialize the rings as well.  Otherwise past TDs
would be executed.  But again, I would like to address this issue
generically, and not make some special case for suspend.

> >>>>> 2)      Can we free the segments at the time of suspend and
> >>>>> allocate segments back when resume happens?
> >>>>>
> >>>>> In general, as per the spec can we also implement shrinking of
> >>>>> transfer
> >>>>> ring (4.9.2.4).
> >>>>>
> >>>>> Please help to provide your comments on the options above (which can
> >>>>> easily fit in to the current design) also feel free to add better
> >>>>> options
> >>>>> to consider to avoid ring expansion failures which can be added
> >>>>> easily
> >>>>> to
> >>>>> the current implementation.

I think there's a couple of ways we could fix this.

One would be to set the IOC flag on the last no-op TRBs in a TD we're
trying to cancel.  I think that will make the event handling code
increment the dequeue pointer and update the number of free TRBs on the
ring (or the code could be changed to do so).  That would take care of
the num_trbs_free counting issue for URB cancellation that isn't part of
device suspend.

A second option would be to make the cancellation code smarter, and move
the dequeue command past the last canceled TRB that was turned into a
no-op TRB.  The code would have to be smart to handle cases where
there's no-op TRBs interspersed with transfer TRBs.  For example, when
the driver queues URBs 1, 2, and 3 to the ring, but only cancels URBs 1
and 3.  Possibly once all TDs are turned into no-ops, we could just walk
the ring from the dequeue pointer, stop on the first normal TRB, and
then set the dequeue pointer to that.  The code would have to make sure
to do this only for endpoints that aren't enabled for streams.

A third option would be to implement some new code as a host controller
callback for device suspend/resume.  Right now, we really should be
issuing a Stop Endpoint command with the suspend bit set for each device
in the tree, but we only do so for roothub ports when the bus is
suspended.  We know all URBs need to be canceled before the device is
suspended, so we could do something like issue a Stop Endpoint command
for each endpoint ring, and then shrink the ring back down to two
segments and re-initialize them.  The device resume hook would need to
issue a Set TR dequeue pointer for the rings.  The code would have to
handle stream endpoint rings as well, which makes it much more complex.

Option 1 is the simplest to implement, I think.  However, it does mean
that the host controller has to generate a lot of events, which some
hosts may have issues with.

Option 2 is medium level complexity, but it has a nice side benefit.
Right now, no-op TRBs on a periodic endpoint mean the host should skip
that service interval.  Since the endpoint ring is still running when
the device is suspended, I think that means the host will skip those
intervals while the device is suspended.  However, if the endpoint has a
long service interval with a short suspend time, we could potentially
see the host skip some service intervals on resume before getting to the
next normal TRB.

With Option 2 in place, the dequeue pointer will be moved past all 10
canceled TRBs, and no no-op TRBs will be unnecessarily generated.  So I
think I like option 2 better.

While we eventually need to implement the device suspend hook to issue
the Stop Endpoint command with the suspend bit set, I think option 3 is
probably too complex to consider.

Are there other ways we could solve this?

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html