Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

walt <w41ter@xxxxxxxxx> · Tue, 07 Jan 2014 05:29:48 -0800

On 01/06/2014 04:31 PM, Sarah Sharp wrote:
> On Fri, Jan 03, 2014 at 03:29:29PM -0800, Sarah Sharp wrote:

>>  With the dmesg, I can finally see what happened:
>>
>> [  188.703059] xhci_hcd 0000:03:00.0: Cancel URB ffff8800b7d2e0c0, dev 1, ep 0x2, starting at offset 0xbb7b9000
>> [  188.703072] xhci_hcd 0000:03:00.0: // Ding dong!
>> [  193.711022] xhci_hcd 0000:03:00.0: xHCI host not responding to stop endpoint command.
>> [  193.711029] xhci_hcd 0000:03:00.0: Assuming host is dying, halting host.
>> [  193.711046] xhci_hcd 0000:03:00.0: // Halt the HC
>> [  193.711060] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 0
>> [  193.711066] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 2
>> [  193.711078] xhci_hcd 0000:03:00.0: Killing URBs for slot ID 1, ep index 3
>> [  193.711096] xhci_hcd 0000:03:00.0: Calling usb_hc_died()
>> [  193.711103] xhci_hcd 0000:03:00.0: HC died; cleaning up
>> [  193.711116] xhci_hcd 0000:03:00.0: xHCI host controller is dead.
>>
>> It seems that the xHCI driver tried to stop the endpoint ring in order
>> to cancel a SCSI transfer, and the driver never got a response for that.
>>
>> The offset is rather suspicious (0xbb7b9000), and it probably means the
>> driver attempted to cancel a transfer that had been moved to the
>> beginning of the ring segment, with no-op TRBs before the link TRB.
>>
>> I suspect David's patch triggers a bug in the command cancellation code.
>> There's also the unlikely possibility that the no-op TRBs did indeed
>> cause the host to hang.  Either way, I'll have to look into it.
>>
>> I'll let you know when I have some diagnostic patches ready.
> 
> Hi Walt,
> 
> I have a couple of patches for you to test.

> Please only apply the first patch (which is diagnostic only), trigger
> your issue, and send me the resulting dmesg.  Then try applying the
> other two patches, and see if the issue goes away.  (I suspect it won't
> but I can't be sure.)

Thanks Sarah.  dmesg0 is from the diagnostic patch only.  dmesg1 has all
three patches applied.  Some of the messages in dmesg1 fell off the end of
the kernel buffer, so I may need to make the buffer larger next time but
I'll need a reminder of how to do it.

As you suspected, the patches didn't fix the problem, sorry.

I find that I can tell in advance whether the copy is going to succeed,
just by watching the light flicker on the usb3 drive.  When the flicker
is absolutely regular, with no variation whatever, I can tell in 10 or
15 seconds that the copy will fail.

At the same time the light on the main drive goes dark after 10 seconds,
implying that the usb3 drive stops receiving any data from the main drive
after 10 seconds, yet the light on the usb3 drive continues to flicker as
if writing data -- even after the cp officially fails.  The light on the
usb3 drive never stops flickering until I reboot the machine or unplug
the usb cable.

Attachment:
dmesg0.gz

Description: application/gzip
Attachment:
dmesg1.gz

Description: application/gzip