Re: Kernel lockup when unplugging device from hub

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Mon, 27 Jul 2009 16:33:22 -0400 (EDT)

On Fri, 24 Jul 2009, Matthijs Kooijman wrote:

> Hi Alan,
> 
> firstly, I tried disabling the USB_TT_NEWSCHED. That didn't solve the lockup
> problem. The rest of my tests have been done without NEWSCHED enabled.

USB_TT_NEWSCHED shouldn't make any difference.

> > 	http://marc.info/?t=124807676700001&r=1&w=2
> > You can try repeating some of the tests described there.
> I've applied the patch suggested there. This obviously doesn't show the
> warning anymore, but it also prevents my system from locking up. Instead of
> the warning and lockup, I now get messages telling me ehci can't reset the
> (Keyboard receiver) device, with error codes -71 or -108. Most of the times I
> get both codes, but sometimes only -108 and once neither. 

That's normal.  -71 means there was a communications error, which you
would expect since the receiver wasn't plugged in and hence didn't
respond to the messages sent by the kernel.  -108 means the device has 
been unplugged, which again is to be expected.

> Apart from the errors messages, my system seems usuable, though I've had some
> problems with my tablet (see below) and foun that my keyboard stopped
> responding at times, which might or might not be related.
> 
> All of the following tests are with the patch applied.
> 
> Just like in the thread you refer to, I get some dma_pool_destroy errors when
> removing the ehci_hcd module:
>   ehci_hcd 0000:00:13.2: dma_pool_destroy ehci_qtd, ffff88004880f000 busy
>   ehci_hcd 0000:00:13.2: dma_pool_destroy ehci_qh, ffff88004884c000 busy

Those are a real problem.  I still need to figure out what's causing 
them.

> Additionally, I get a lot of these (I think they started only after the first
> disconnect or rmmod, not 100% sure):
>   ehci_hcd 0000:00:13.2: detected XactErr len 0/8 retry 25
> with incrementing retry numbers (I also saw a retry -189 once), or

Those XactErr message also indicate communications errors.  But you
have far too many of them in your log; they suggest there is something
wrong with your hardware.  Probably a bad cable or a bad hub.

> combinations of these two lines:
>   ehci_hcd 0000:00:13.2: detected XactErr len 0/9 retry 1
>   usb 1-2.1.3: unlink qh4-0601/ffff88004884c6c0 start 1 [1/2 us]
> with retry 1 every time. The latter seem to be outputted only when I have my
> wacom tablet plugged in, and seem to cause it to be jumpy. This seems an
> unrelated issue, though, since these messages appear at boot already, befor
> unplugging any device.

This funny behavior is caused by that hub-or-cable problem.  You might 
want to try plugging the Wacom tablet directly into the other hub (the 
one the audio device is attached to).

> Not sure how relevant this is, but as soon as I rmmod'ed the ehci_hcd module,
> the USB devices registered with the ochi_hcd module (I think) and continued
> working, with no more errors when unplugging anything.

Again, that's expected.  This bug I'm trying to track down is in 
ehci-hcd, not in ohci-hcd.

> I originally thought that the dma_pool_destroy errors disappeared when
> ohci_hcd is not loaded, though on a second try, they were still there.
> 
> I did a more extended test (where I started out with too many USB devices
> plugged in, sorry for that noise). I did find that the can't reset errors
> didn't occur when unplugging my tablet, only with the keyboard receiver.
> 
> 
> I've also enabled usbmon and got a few traces. Please find a full kernel log,
> from boot at http://www.stdout.nl/ehci-debug/kernel.log.txt together with
> http://www.stdout.nl/ehci-debug/during-disconnect.usbmon.txt and
> http://www.stdout.nl/ehci-debug/during-rmmod.usbmon.txt
> 
> I hope this helps.

There's still those dma_pool_destroy problems to track down...

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html