On Sat, Feb 19, 2011 at 10:48:27AM -0500, Alan Stern wrote: > On Fri, 18 Feb 2011, Don Zickus wrote: > > > Hi, > > > > I am trying to debug a panic for our 2.6.32 based kernel (there isn't any > > changes to usb other than usb3 stuff in our version of 2.6.32) > > You mean it includes all the changes below drivers/usb that are in the > vanilla 2.6.37 kernel (except for the USB-3 stuff)? Sorry, I meant to say the usb stuff is mostly stock 2.6.32 with just backports of the usb3 on top of it. (I was trying to be upfront with the fact that the kernel isn't identical to 2.6.32 and to convince people I wasn't trying to waste their time by explaining that the changes were mostly to the usb3 subsytem. But I understand if the kernel is to old for anyone to care). > > > The panic is attached below and I am having trouble reproducing it, so I > > am trying to 'think' this one out. I believe there is a race condition > > but I don't know enough about the paths in usb to understand it and was > > hoping for some help from the mailing list. > > You should consider enabling CONFIG_PRINTK_TIME in all your test > kernels. The high-precision timestamps can often help with debugging. good point. > > > The problem is when running a stress test (scrashme) on a powerpc blade, > > we believe someone accidentally pushed a button on the blade that 'magical > > routes' the side mounted usb cdrom to that blade. A few moments later, we > > believe that someone realized their error and pressed the button on the > > correct blade, thus disconnecting the cdrom and having it routed to the > > other blade. > > > > As a result the below panic happened. Looking at where the panic happened > > and the assembly code, I am reasonably confident the panic happened at: > > > > drivers/usb/core/hcd.c::usb_hcd_unlink_urb::1459 > > > > (right before the unlink1 command) > > hcd = bus_to_hcd(urb->dev->bus); > > > > what happens is that urb->dev is NULL and thus the derefence to dev->bus > > panics the box. > > Are you sure that urb->dev is NULL? As opposed to pointing to a memory > location that used to be occupied by a device structure and now > contains some other data? Well looking at the register in the panic output that is used to derefernce the memory, it shows 0x0 (r9). Also the panic message itself shows that it can not access memory at 0x00000040 (which according to the dissassembly of the code shows a 64-byte offset of r9, which was zero). So I am pretty sure the urb->dev pointer was NULL. However, I can't say for sure the urb itself was corrupted or contains new data. r0 was supposed to contain the use_count but it doesn't look right. I'll double check with some good data tomorrow. > > > The only way I can see that happening is usb_put_dev went to zero and > > released the device (which would mean the usb_put_dev a couple lines later > > would cause another friendly message). > > This would not affect urb->dev, which suggests that you're not looking > at it the right way. Ok. I thought if the refcount from using usb_put_dev went to zero, the urb->dev would be free'd. Like you said, I am probably mis-understanding the code. > > > My first impression is that there > > is a race condition somewhere, but I don't know the different paths well > > enough to know where. > > > > Does anyone have any thoughts about this or can help me through this > > (especially since I am having trouble reproducing it :-/). > > I could help, given more information. At this stage, I don't think you > know enough about the problem to be able to track it down. Unless you > can reproduce the bug, the situation may be hopeless. Sorry... Thanks anyway. Sorry for wasting your time. I thought I would try and give it a shot with the limited info I had. Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html