On Wed, May 19, 2010 at 06:55:20PM +0800, Xu, Andiry wrote: > > -----Original Message----- > > From: Sarah Sharp [mailto:sarah.a.sharp@xxxxxxxxxxxxxxx] > > Hi Andiry, > > > > I've been able to grab the oops message out after turning off a lot of > > debugging so that I could use netconsole. The modifications I made to > > turn off > > debugging are on my amd-isoc branch. This time the oops was triggered > > with just a HS webcam and a FS webcam plugged into a HS hub. > > > > Hi Sarah, > > When did the oops triggered? Did it occur when you start the > application, when the application is running, or when you quit the > application? Can it be reproduced every time? I'm using a HS tv-tuner, a > FS webcam plugged into a HS hub and trying to reproduce it. The oops was triggered when I opened cheese with two webcams plugged into a hub. I was also running netconsole at the time, with a couple of patches on top of yours to limit the amount of debugging. It's possible it's related to netconsole. I'll have to test and see. > > The dmesg is attached. The first thing I notice is that you're giving > > back URB > > ffff88012fe9f000 twice, any idea how that could happen? > > > > It seems the URB is given back twice successfully, which is strange. > when the first time usb_hcd_giveback_urb() called, urb->hcpriv is set to > NULL, and it will fail the second time as NULL pointer dereference when > using urb_priv. > As I can see, does URB ffff88012fe9f000 more like a resubmitted URB or > duplicate print? The issue occurs with URB ffff88012fe9e800. Ah, I see. Yes, the line about giving back URB ffff88012fe9f000 seems like a duplicate print in the dmesg. You're right that URB ffff88012fe9e800 looks like the real problem. I can't tell whether it was resubmitted after it was given back on line 5968 and before it was canceled on 5970. > > When I ran the trimmed oops message through scripts/markup_oops.pl, > got > > the following output: > > > > No vmlinux specified, assuming /lib/modules/2.6.34-rc7/build/vmlinux > > break; > > case TRB_TYPE(TRB_TRANSFER): > > ret = handle_tx_event(xhci, &event->trans_event); > > if (ret < 0) > > xhci->error_bitmask |= 1 << 9; > > ffffffffa030179e: 80 cc 02 or $0x2,%ah > > ffffffffa03017a1: 89 83 c0 09 00 00 mov %eax,0x9c0(%rbx) > | > > %ebx = ffff88012e146220 > > ffffffffa03017a7: b8 01 00 00 00 mov $0x1,%eax | > %eax > > => ffff88012fe9e800 > > ffffffffa03017ac: e9 af f3 ff ff jmpq ffffffffa0300b60 > > <xhci_handle_event+0xa0> > > ffffffffa03017b1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) | > %eax > > = ffff88012fe9e800 > > int skip_td = 0; > > union xhci_trb *cur_trb; > > struct xhci_segment *cur_seg; > > > > urb_priv = td->urb->hcpriv; > > idx = urb_priv->td_cnt; > > ffffffffa03017b8: 48 8b 50 08 mov 0x8(%rax),%rdx | > > %eax = ffff88012fe9e800 %edx => 0 > > *ffffffffa03017bc: 8b 4a 04 mov 0x4(%rdx),%ecx | > > %edx = 0 %ecx = ffff8800b7dc7680 <--- faulting instruction > > status = 0; > > > > if (ep->skip) { > > ffffffffa03017bf: 4b 8d 54 ad 00 lea > > 0x0(%r13,%r13,4),%rdx > > ffffffffa03017c4: 48 8d 14 92 lea > (%rdx,%rdx,4),%rdx > > ffffffffa03017c8: 41 80 bc d7 e0 00 00 cmpb > > $0x0,0xe0(%r15,%rdx,8) > > ffffffffa03017cf: 00 00 > > ffffffffa03017d1: 0f 85 bf 00 00 00 jne ffffffffa0301896 > > <xhci_handle_event+0xdd6> > > -EREMOTEIO; > > else > > td->urb->iso_frame_desc[idx].status = 0; > > } else { > > /* handle completion code */ > > switch (trb_comp_code) { > > ffffffffa03017d7: 83 7d cc 1f cmpl $0x1f,-0x34(%rbp) > > ffffffffa03017db: 0f 86 e4 01 00 00 jbe ffffffffa03019c5 > > <xhci_handle_event+0xf05> > > case COMP_STOP: > > > > So it looks like the urb_priv might be NULL. If you're giving back > URBs > > that have already been given back, then that could be the cause. > > > > I wonder why this would happen. It seems the urb of the td has already > been given back, so the urb_priv is NULL. But a URB can be given back > only when urb_priv->td_cnt is equal to urb_priv->length. > Urb_priv->td_cnt is increased only in handle_tx_event() when the td is > normally processed, or in xhci_giveback_urb_in_irq(), when the td is > cancelled or killed. As I can see, the two cases will never occur with > the same td. Do I miss something? Maybe the td_list is not in sync with you giving back the URB? Maybe there's a race condition between the cancellation code and the missed service event code? What happens if an URB is canceled, a stop endpoint command is queued, and then we get a missed service event? If your code is in the middle of processing the missed service event when the stop endpoint command completes and an interrupt is fired, what happens? I'm especially concerned about dropping the xhci->lock to giveback URBs in this corner case. > The other thing I notice is the skip flag of the ep is set but not > clear. That's not right. The skip flag should be cleared when the > corresponding td is found or when the ep td_list is empty. And there is > a Event Ring Full Error and a Missed Service Error in event ring. The > xHC missed so many tds, from td at 6a2c4ab0 to td at 6e1b4810, as I > count more than 80 tds missed. Is that normal? I would say that's not normal. That sounds like you're not updating the dequeue pointer for the ring when you're processing the skipped TDs. I don't know why you would get an event ring full error otherwise. I suppose the hardware could be messing up, but if so, we'll have to deal with it because I'm testing with a host you can buy off the shelf. :( > It's easy to add a urb_priv check that's not the root cause. I will try > to figure it out. Any suggestions are greatly appreciated. I'll take a look at the code and see if I can spot anything. Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html