Re: Etron EJ168 USB3.0 rev1 and SanDisk Extreme 64GB USB3.0 Stick: random resets

Ivan P <chrnosphered@xxxxxxxxx> · Thu, 14 Jan 2016 22:17:29 +0100

I've tested linux-lts (4.1.15), and the same thing happens (see
usbmon_sandisk_on_lts trace)
The verbose debugging is also included (dmesg_xhci_ex). I don't have
another USB3.0 device,
but I tested with an USB2.0 one and I'm getting intermittent one
second freezes of the whole PC
when writing to that USB2.0 stick. Using an USB2.0 port, nothing of
the sort happens. However,
even with the freezes, the copy process finishes correctly (see
usbmon_patriot trace)

I've also tried the stick on another PC that has the NEC Corporation
uPD720200 USB 3.0
Host Controller (rev 03), but that one's even older,  and
unsurprisingly shows a similar picture
(see usbmon_sandisk_on_nec)

The new traces are in the same dropbox folder as before.

On Thu, Jan 14, 2016 at 6:08 PM, Mathias Nyman
<mathias.nyman@xxxxxxxxxxxxxxx> wrote:
> On 13.01.2016 00:01, Alan Stern wrote:
>>
>> On Tue, 12 Jan 2016, Ivan P wrote:
>>
>>> I've uploaded the usbmon traces here:
>>> https://www.dropbox.com/sh/0gldb4r4g6p4p5w/AAAdmHP_Slya3f440v9oe1qka?dl=0
>>>
>>> One run tracing every bus (0u), one run tracing only the sandisk stick
>>> (2u).
>>> Each trace is from starting to copy the files to the point it hangs
>>> up, at which I attempt to cd into the mount point.
>>
>>
>> I looked at the second trace.  It seems to indicate a bug in the xHCI
>> host controller hardware or driver.
>>
>> Everything is okay almost up to the end.  Here's where the trouble
>> starts:
>>
>>> ffff8802fd7bf180 1135243584 S Bo:2:004:2 -115 31 = 55534243 1cc50200
>>> 00100000 80000a28 0000000d e0000008 00000000 000000
>>> ffff8802fd7bf180 1135243601 C Bo:2:004:2 0 31 >
>>> ffff88028df546c0 1135243607 S Bi:2:004:1 -115 4096 <
>>> ffff88028df546c0 1165404890 C Bi:2:004:1 -104 4096 = e83ad4f0 096d0965
>>> b6e2cf54 9165f0e8 2a39d865 8e097d4d 2bef792c e0e7adaf
>>
>>
>> This shows the computer trying to read 4 KB of data from the device.
>> All of the data was received okay, but for some reason the transfer
>> didn't end properly.  Instead, it timed out after 30 seconds and was
>> cancelled.  That's the fundamental bug.
>>
>> Attempts to recover by resetting the device failed (apparently due to a
>> bug in the device) and from that point on, nothing worked.  The device
>> kept reporting failures for each command, but with no error code.
>>
>> Since the original problem looks like an xHCI-related issue, maybe
>> Mathias can suggest some things to try.
>>
>
> Does this occur on xhci hosts from other verndors? How about older kernels?
> (before 4.3)
>
> There was a change in 4.3 kernel (and older stable) in how the xhci driver
> returns bulk in URBs. If transfers are short the driver won't give back the
> URB
> immediately, instead it waits until it get a completion event for the last
> transfer block
> in that transfer descriptor.
> (we should get the event even if the transfer was short and never filled)
>
> Turns out not all hosts send this second completion event, so that change
> will be reverted
>
> commit e210c422b6fdd2dc123bedc588f399aefd8bf9de
>     xhci: don't finish a TD if we get a short transfer event mid TD
>  Even if it is supposed to only affect short transfer with data over 64k the
> symptoms you see
> fits this area. If xhci driver doesn't return the URB, it will be canceled
> with ECONNRESET (-104)status
>
> Does verbose debugging for xhci show anything?  Enable it with:
> echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
>
> -Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html