Re: [PATCH RESEND v5 0/3] ISOC supporting for xHCI

Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx> · Mon, 24 May 2010 15:21:02 -0700

I'm not sure if this is an issue, but here's one thing I notice when
looking over the code.  I think your code could get messed up because of
how you use urb_done in handle_tx_event().

It gets initialized to false at the beginning of the function:

static int handle_tx_event(struct xhci_hcd *xhci,
                struct xhci_transfer_event *event)
{
        struct xhci_virt_device *xdev;
...
        bool urb_done = false;

Later, you added this goto statement here:

handle_td:
        /* This TRB should be in the TD at the head of this ring's TD list */
        if (list_empty(&ep_ring->td_list)) {

When you find that all TDs of an URB are processed, you set urb_done to true:

                urb_priv->td_cnt++;
                /* Giveback urb when all the tds are completed */
                if (urb_priv->td_cnt == urb_priv->length)
                        urb_done = true;

                /* Leave the TD around for the reset endpoint function to use
                 * (but only if it's not a control endpoint, since we already
                 * queued the Set TR dequeue pointer command for stalled
                 * control endpoints).
                 */
                if (usb_endpoint_xfer_control(&urb->ep->desc) ||
                        (trb_comp_code != COMP_STALL &&
                                trb_comp_code != COMP_BABBLE)) {
                        if (urb_done)
                                urb_free_priv(xhci, urb_priv);
                }

Later, you look at urb_done to decide whether to give the URB back:

        if (urb && urb_done) {
                usb_hcd_unlink_urb_from_ep(xhci_to_hcd(xhci), urb);
                xhci_dbg(xhci, "Giveback URB %p, len = %d, status = %d\n",
                                urb, urb->actual_length, status);
                spin_unlock(&xhci->lock);
                usb_hcd_giveback_urb(xhci_to_hcd(xhci), urb, status);
                spin_lock(&xhci->lock);
        }

If the ep->skip flag is set at the end of the function, you jump back to
handle_td.

Now, what happens if the host has skipped multiple URB's worth of TDs?
On the last TD in the first skipped URB, urb_done will be set to true.
Then you'll give the URB back.  Now you jump to handle_td, and urb_done
is *still* set to true (since you didn't execute the first part of the
function where you set it to false).

You process the first TD in the next URB, and since the urb_done flag is
set, so you giveback the URB.  Then you attempt to process the next TD
in the URB, and you've freed the urb_priv pointer, so you do a NULL
pointer dereference.

So I'm not sure if that's your exact bug, but it seems likely.  I think
if you drop the urb_done flag and replace it with (urb_priv->td_cnt ==
urb_priv->length), you could work around this bug.

Also, is there a way you can change the code so that you don't have the
handle_td jump?  It's likely there will be other bugs like this.  I know
that handle_tx_event() is a big mess, and I apologize for that.  I had
some patches to break it up into smaller functions, but they no longer
apply against the current tree.

Sarah Sharp

On Fri, May 21, 2010 at 12:56:54PM -0700, Sarah Sharp wrote:
> On Wed, May 19, 2010 at 06:55:20PM +0800, Xu, Andiry wrote:
> > > -----Original Message-----
> > > From: Sarah Sharp [mailto:sarah.a.sharp@xxxxxxxxxxxxxxx]
> > > Hi Andiry,
> > > 
> > > I've been able to grab the oops message out after turning off a lot of
> > > debugging so that I could use netconsole.  The modifications I made to
> > > turn off
> > > debugging are on my amd-isoc branch.  This time the oops was triggered
> > > with just a HS webcam and a FS webcam plugged into a HS hub.
> > > 
> > 
> > Hi Sarah,
> > 
> > When did the oops triggered? Did it occur when you start the
> > application, when the application is running, or when you quit the
> > application? Can it be reproduced every time? I'm using a HS tv-tuner, a
> > FS webcam plugged into a HS hub and trying to reproduce it.
> 
> The oops was triggered when I opened cheese with two webcams plugged
> into a hub.  I was also running netconsole at the time, with a couple of
> patches on top of yours to limit the amount of debugging.  It's possible
> it's related to netconsole.  I'll have to test and see.
> 
> > > The dmesg is attached.  The first thing I notice is that you're giving
> > > back URB
> > > ffff88012fe9f000 twice, any idea how that could happen?
> > > 
> > 
> > It seems the URB is given back twice successfully, which is strange.
> > when the first time usb_hcd_giveback_urb() called, urb->hcpriv is set to
> > NULL, and it will fail the second time as NULL pointer dereference when
> > using urb_priv. 
> > As I can see, does URB ffff88012fe9f000 more like a resubmitted URB or
> > duplicate print? The issue occurs with URB ffff88012fe9e800.
> 
> Ah, I see.  Yes, the line about giving back URB ffff88012fe9f000 seems
> like a duplicate print in the dmesg.  You're right that URB
> ffff88012fe9e800 looks like the real problem.  I can't tell whether it
> was resubmitted after it was given back on line 5968 and before it was
> canceled on 5970.
> 
> > > When I ran the trimmed oops message through scripts/markup_oops.pl,
> > got
> > > the following output:
> > > 
> > > No vmlinux specified, assuming /lib/modules/2.6.34-rc7/build/vmlinux
> > >  		break;
> > >  	case TRB_TYPE(TRB_TRANSFER):
> > >  		ret = handle_tx_event(xhci, &event->trans_event);
> > >  		if (ret < 0)
> > >  			xhci->error_bitmask |= 1 << 9;
> > >  ffffffffa030179e:	80 cc 02             	or     $0x2,%ah
> > >  ffffffffa03017a1:	89 83 c0 09 00 00    	mov    %eax,0x9c0(%rbx)
> > |
> > > %ebx = ffff88012e146220
> > >  ffffffffa03017a7:	b8 01 00 00 00       	mov    $0x1,%eax   |
> > %eax
> > > => ffff88012fe9e800
> > >  ffffffffa03017ac:	e9 af f3 ff ff       	jmpq   ffffffffa0300b60
> > > <xhci_handle_event+0xa0>
> > >  ffffffffa03017b1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)   |
> > %eax
> > > = ffff88012fe9e800
> > >  		int skip_td = 0;
> > >  		union xhci_trb *cur_trb;
> > >  		struct xhci_segment *cur_seg;
> > > 
> > >  		urb_priv = td->urb->hcpriv;
> > >  		idx = urb_priv->td_cnt;
> > >  ffffffffa03017b8:	48 8b 50 08          	mov    0x8(%rax),%rdx |
> > > %eax = ffff88012fe9e800  %edx => 0
> > > *ffffffffa03017bc:	8b 4a 04             	mov    0x4(%rdx),%ecx |
> > > %edx = 0  %ecx = ffff8800b7dc7680 <--- faulting instruction
> > >  		status = 0;
> > > 
> > >  		if (ep->skip) {
> > >  ffffffffa03017bf:	4b 8d 54 ad 00       	lea
> > > 0x0(%r13,%r13,4),%rdx
> > >  ffffffffa03017c4:	48 8d 14 92          	lea
> > (%rdx,%rdx,4),%rdx
> > >  ffffffffa03017c8:	41 80 bc d7 e0 00 00 	cmpb
> > > $0x0,0xe0(%r15,%rdx,8)
> > >  ffffffffa03017cf:	00 00
> > >  ffffffffa03017d1:	0f 85 bf 00 00 00    	jne    ffffffffa0301896
> > > <xhci_handle_event+0xdd6>
> > >  						 -EREMOTEIO;
> > >  			else
> > >  				td->urb->iso_frame_desc[idx].status = 0;
> > >  		} else {
> > >  			/* handle completion code */
> > >  			switch (trb_comp_code) {
> > >  ffffffffa03017d7:	83 7d cc 1f          	cmpl   $0x1f,-0x34(%rbp)
> > >  ffffffffa03017db:	0f 86 e4 01 00 00    	jbe    ffffffffa03019c5
> > > <xhci_handle_event+0xf05>
> > >  			case COMP_STOP:
> > > 
> > > So it looks like the urb_priv might be NULL.  If you're giving back
> > URBs
> > > that have already been given back, then that could be the cause.
> > >
> > 
> > I wonder why this would happen. It seems the urb of the td has already
> > been given back, so the urb_priv is NULL. But a URB can be given back
> > only when urb_priv->td_cnt is equal to urb_priv->length.
> > Urb_priv->td_cnt is increased only in handle_tx_event() when the td is
> > normally processed, or in xhci_giveback_urb_in_irq(), when the td is
> > cancelled or killed. As I can see, the two cases will never occur with
> > the same td. Do I miss something?
> 
> Maybe the td_list is not in sync with you giving back the URB?  Maybe
> there's a race condition between the cancellation code and the missed
> service event code?  What happens if an URB is canceled, a stop endpoint
> command is queued, and then we get a missed service event?  If your code
> is in the middle of processing the missed service event when the stop
> endpoint command completes and an interrupt is fired, what happens?  I'm
> especially concerned about dropping the xhci->lock to giveback URBs in
> this corner case.
> 
> > The other thing I notice is the skip flag of the ep is set but not
> > clear. That's not right. The skip flag should be cleared when the
> > corresponding td is found or when the ep td_list is empty. And there is
> > a Event Ring Full Error and a Missed Service Error in event ring. The
> > xHC missed so many tds, from td at 6a2c4ab0 to td at 6e1b4810, as I
> > count more than 80 tds missed. Is that normal?
> 
> I would say that's not normal.  That sounds like you're not updating the
> dequeue pointer for the ring when you're processing the skipped TDs.  I
> don't know why you would get an event ring full error otherwise.  I
> suppose the hardware could be messing up, but if so, we'll have to deal
> with it because I'm testing with a host you can buy off the shelf. :(
> 
> > It's easy to add a urb_priv check that's not the root cause. I will try
> > to figure it out. Any suggestions are greatly appreciated.
> 
> I'll take a look at the code and see if I can spot anything.
> 
> Sarah Sharp
> --
> To unsubscribe from this list: send the line "unsubscribe linux-usb" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html