RE: [PATCH][RFC] xhci fixes

"Xu, Andiry" <Andiry.Xu@xxxxxxx> · Wed, 18 Aug 2010 10:19:02 +0800

Hi Sarah,

When a missed service interval event is encountered, we just set the
ep->skip flag, increase the event ring dequeue pointer, and then return.
We will enter the do-while loop and process missed tds the next time we
encounter a transfer event from this endpoint.

If the next event after a missed service interval isn't for that
endpoint, I think is OK. This endpoint does not have its skip flag set,
and the do-while loop will run only once and then returns. If this
endpoint's skip flag is also set, then we'll process its missed tds,
since it must have a missed service interval event some time ago.

If we get two missed service intervals for the same endpoint in a row,
it will be two cases: the two missed service interval events are
consecutive, or not. If they are consecutive, we just return and process
the missed tds next time when it's not a missed service interval event;
if they are not consecutive and there's a transfer event between them,
we should process the missed tds for twice.

If a missed service interval is followed by a Ring Underrun/overrun
event for that isoc ring, there may be problems. In this case the
handle_tx_event() should return. I think ep->skip flag should get
cleared when encounter a Ring Underrun/overrun event. Maybe this causes
Paul's issue?

If Paul can print his event ring when the infinite loop case happens,
that may help to find the actual case.

Thanks & Best regards,
Andiry

> -----Original Message-----
> From: Sarah Sharp [mailto:sarah.a.sharp@xxxxxxxxxxxxxxx]
> Sent: Wednesday, August 18, 2010 8:07 AM
> To: Paul Zimmerman
> Cc: linux-usb@xxxxxxxxxxxxxxx; Xu, Andiry
> Subject: Re: [PATCH][RFC] xhci fixes
> 
> Hi Andiry,
> 
> I have a couple questions about your isoc patches.  I think the
do-while
> loop in handle_tx_event() when a missed service interval is
encountered
> is trying to walk the event ring to find the next isoc TD that was
> actually completed.  It doesn't touch the event ring dequeue pointer
> (except to move it past the missed service interval event).
> 
> Now, what if the next event after a missed service interval isn't for
> that endpoint?  What if you get two missed service intervals for the
> same endpoint in a row?  Or a missed service interval followed by a
Ring
> Underrun/overrun event for that isoc ring?  I think the code may have
> bugs in those cases, although I'm not sure if those are the exact bugs
> that Paul is running into.  (Paul, I'll send you a debugging patch
> shortly.)
> 
> I think the while condition at the end of the do-while loop is
> incorrect:
>         } while (ep->skip && trb_comp_code != COMP_MISSED_INT);
> 
> trb_comp_code is only assigned once before the do-while loop, so it
will
> never change to break out of the loop once we hit a missed service
> interval event.  Only the clearing of ep->skip can make the loop stop.
> But if the next event after the missed service interval isn't a
transfer
> event for that endpoint ring, then the loop will run indefinitely.
> 
> Now, Paul's patch attempted to break out of the loop and return when
> handle_tx_event() found a null event_seg when ep->skip was set for an
> isochronous endpoint.  It's possible the event ring and the endpoint
> ring eventually synched up, and the isoc TDs that were skipped when
the
> handle_tx_event() function originally returned were never given back
to
> the driver.  That might explain the null dequeue segment pointer after
> the driver attempted to cancel the isochronous URBs.  I'm still
looking
> at the code to figure out if it's possible though.
> 
> Sarah Sharp
> 
> On Mon, Aug 16, 2010 at 06:59:49PM -0700, Paul Zimmerman wrote:
> > > From: Sarah Sharp [mailto:sarah.a.sharp@xxxxxxxxxxxxxxx]
> >
> > > First, Andiry already sent a patch to update the dequeue pointer
in
> > > process_isoc_td().  That just got added to the master branch a
couple
> > > days ago.  So you should take that bit of code out of this patch.
> >
> > Will do.
> >
> > > Those BUG() calls were there for a reason.  Does the system work
> without
> > > them?
> >
> > Yes, it works without them, at least with the Synopsys controller. I
> also tested
> > a bit with the NEC controller and didn't see any problems.
> >
> > According to Linus, drivers should do WARN_ON() instead of BUG() in
> interrupt
> > context, unless it's absolutely impossible to continue. Sorry, I
don't
> have a
> > reference to that email handy.
> >
> > > Can you break out the fixes for uninitialized variables into a
> separate
> > > patch?
> >
> > Will do.
> >
> > > Please break this into a separate patch.  I see that bits 10:0 are
> > > supposed to be the maximum packet size, which should make a
bitmask of
> > > 0x7ff, not 0x3ff.  Thanks for catching that.
> >
> > Will do.
> >
> > > > @@ -131,7 +131,7 @@ static void next_trb(struct xhci_hcd *xhci,
> > > >  		*seg = (*seg)->next;
> > > >  		*trb = ((*seg)->trbs);
> > > >  	} else {
> > > > -		*trb = (*trb)++;
> > > > +		(*trb)++;
> > >
> > > This part of the patch is from John's compiler fix patch, correct?
> It's
> > > already been sent to Greg, but hasn't made it into the master
branch
> yet
> > > because Greg hasn't queued it.
> >
> > Yeah, I'll drop that too. By the way, that patch should probably go
to
> > -stable too, since it's undefined behavior in C, if I'm not
mistaken.
> >
> > > > @@ -474,8 +474,11 @@ void xhci_find_new_dequeue_state(struct
> xhci_hcd *xhci,
> > > >  	state->new_deq_seg = find_trb_seg(cur_td->start_seg,
> > > >  			dev->eps[ep_index].stopped_trb,
> > > >  			&state->new_cycle_state);
> > > > -	if (!state->new_deq_seg)
> > > > -		BUG();
> > > > +	if (!state->new_deq_seg) {
> > > > +		WARN_ON(1);
> > > > +		return;
> > > > +	}
> > > > +
> > >
> > > This is not a very good solution.  This function is called when
the
> > > driver wants to cancel a transfer, and the host is in the middle
of
> the
> > > transfer.  If you just return there, then the transfer will never
get
> > > taken off the hardware queue, and the hardware could potentially
> access
> > > freed (or reused) memory.  Or the transfer will never get
cancelled.
> > > Really bad, either way.
> > >
> > > I'd like to figure out why exactly find_trb_seg() is returning
NULL.
> Do
> > > you have any ideas?  If not, I'll send you a debugging patch to
get
> more
> > > info.
> >
> > Sorry, I don't have any ideas. That part of the xhci driver is black
> magic
> > to me. A debugging patch would be a good idea, I think.
> >
> > > > @@ -1551,6 +1556,10 @@ static int process_isoc_td(struct
xhci_hcd
> *xhci, struct xhci_td *td,
> > > >  	/* calc actual length */
> > > >  	if (ep->skip) {
> > > >  		td->urb->iso_frame_desc[idx].actual_length = 0;
> > > > +		/* Update ring dequeue pointer */
> > > > +		while (ep_ring->dequeue != td->last_trb)
> > > > +			inc_deq(xhci, ep_ring, false);
> > > > +		inc_deq(xhci, ep_ring, false);
> > > >  		return finish_td(xhci, td, event_trb, event, ep,
status,
> true);
> > > >  	}
> > >
> > > As I said, this bit has been sent off to Greg.
> >
> > Will drop it.
> >
> > > > @@ -1839,7 +1850,6 @@ static int handle_tx_event(struct xhci_hcd
> *xhci,
> > > >  				xhci_dbg(xhci, "td_list is empty
while skip
> "
> > > >  						"flag set. Clear
skip flag.\n");
> > > >  			}
> > > > -			ret = 0;
> > > >  			goto cleanup;
> > > >  		}
> > >
> > > This bit should be in a separate patch.
> >
> > Will do.
> >
> > >
> > > > @@ -1874,20 +1884,23 @@ static int handle_tx_event(struct
xhci_hcd
> *xhci,
> > > >  						"Skip it\n");
> > > >  				goto cleanup;
> > > >  			}
> > > > -		}
> > > >
> > > > -		/* Now update the urb's actual_length and give
back to
> > > > -		 * the core
> > > > -		 */
> > > > -		if
(usb_endpoint_xfer_control(&td->urb->ep->desc))
> > > > -			ret = process_ctrl_td(xhci, td,
event_trb, event,
> ep,
> > > > -						 &status);
> > > > -		else if
(usb_endpoint_xfer_isoc(&td->urb->ep->desc))
> > > > -			ret = process_isoc_td(xhci, td,
event_trb, event,
> ep,
> > > > -						 &status);
> > > > -		else
> > > > -			ret = process_bulk_intr_td(xhci, td,
event_trb,
> event,
> > > > -						 ep, &status);
> > > > +			/* Now update the urb's actual_length
and give
> back to
> > > > +			 * the core
> > > > +			 */
> > > > +			if
(usb_endpoint_xfer_control(&td->urb->ep->desc))
> > > > +				ret = process_ctrl_td(xhci, td,
event_trb,
> event, ep,
> > > > +
&status);
> > > > +			else if
(usb_endpoint_xfer_isoc(&td->urb->ep-
> >desc))
> > > > +				ret = process_isoc_td(xhci, td,
event_trb,
> event, ep,
> > > > +
&status);
> > > > +			else
> > > > +				ret = process_bulk_intr_td(xhci,
td,
> event_trb, event,
> > > > +							 ep,
&status);
> > > > +		} else {
> > > > +			xhci_dbg(xhci, "event_seg is NULL\n");
> > > > +			return -ENODEV;
> > > > +		}
> > >
> > > I assume you did this part of the patch because handle_tx_event()
was
> > > infinite looping.  I don't think it's a very good fix, and I'd
rather
> > > know the underlying cause for this.
> > >
> > > The fix looks a bit...odd.  The only way the else statement would
run
> is
> > > if this statement is false:
> > > 	if (!event_seg &&
> > >                    (!ep->skip ||
> > > 		   !usb_endpoint_xfer_isoc(&td->urb->ep->desc))) {
> > >
> > > or
> > > 	if (!(!event_seg && (!ep->skip ||
!usb_endpoint_xfer_isoc(&td->urb-
> >ep->desc))))
> > > 	if ((event_seg || !(!ep->skip ||
!usb_endpoint_xfer_isoc(&td->urb-
> >ep->desc))))
> > > 	if ((event_seg || (ep->skip &&
usb_endpoint_xfer_isoc(&td->urb->ep-
> >desc))))
> > >
> > > We know the event_seg is NULL, so the skip flag was set and this
was
> an
> > > isochronous endpoint.  I wonder if the event_seg was NULL because
of
> the
> > > lack of setting the dequeue pointer while processing the skipped
TDs.
> > >
> > > Does the infinite looping go away when you apply *just* the patch
to
> fix
> > > updating the dequeue pointer?  What about the issues with
triggering
> the
> > > BUG() calls while setting the TR dequeue pointer?
> >
> > No, Andiry's fix for the dequeue pointer didn't fix either of those.
> >
> > --
> > Paul
> >

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html