Re: Etron EJ168 USB3.0 rev1 and SanDisk Extreme 64GB USB3.0 Stick: random resets

Ivan P <chrnosphered@xxxxxxxxx> · Thu, 21 Jan 2016 19:12:53 +0100

Turns out I DO have another USB3.0 device, an external HDD enclosure.
The PC it is connected to is so weak that it is unable to reach USB3.0
speeds with it, so I forgot about it. I've tested the controller on my
PC with that USB HDD and there doesn't seem to be any issues unlike
with the Sandisk stick - so it seems to be the fault of the USB stick
and not the controller per se.

How likely is this misbehavior to be fixed at a later date? I'm asking
because I have about a week left to be able to return the USB stick,
so if it's unlikely a workaround for the stick will be found/made, I'd
rather not keep it. Having to boot windows just to copy files to/from
it kind of negates all of the speed aspects it is offering.

On Thu, Jan 14, 2016 at 10:17 PM, Ivan P <chrnosphered@xxxxxxxxx> wrote:
> I've tested linux-lts (4.1.15), and the same thing happens (see
> usbmon_sandisk_on_lts trace)
> The verbose debugging is also included (dmesg_xhci_ex). I don't have
> another USB3.0 device,
> but I tested with an USB2.0 one and I'm getting intermittent one
> second freezes of the whole PC
> when writing to that USB2.0 stick. Using an USB2.0 port, nothing of
> the sort happens. However,
> even with the freezes, the copy process finishes correctly (see
> usbmon_patriot trace)
>
> I've also tried the stick on another PC that has the NEC Corporation
> uPD720200 USB 3.0
> Host Controller (rev 03), but that one's even older,  and
> unsurprisingly shows a similar picture
> (see usbmon_sandisk_on_nec)
>
> The new traces are in the same dropbox folder as before.
>
> On Thu, Jan 14, 2016 at 6:08 PM, Mathias Nyman
> <mathias.nyman@xxxxxxxxxxxxxxx> wrote:
>> On 13.01.2016 00:01, Alan Stern wrote:
>>>
>>> On Tue, 12 Jan 2016, Ivan P wrote:
>>>
>>>> I've uploaded the usbmon traces here:
>>>> https://www.dropbox.com/sh/0gldb4r4g6p4p5w/AAAdmHP_Slya3f440v9oe1qka?dl=0
>>>>
>>>> One run tracing every bus (0u), one run tracing only the sandisk stick
>>>> (2u).
>>>> Each trace is from starting to copy the files to the point it hangs
>>>> up, at which I attempt to cd into the mount point.
>>>
>>>
>>> I looked at the second trace.  It seems to indicate a bug in the xHCI
>>> host controller hardware or driver.
>>>
>>> Everything is okay almost up to the end.  Here's where the trouble
>>> starts:
>>>
>>>> ffff8802fd7bf180 1135243584 S Bo:2:004:2 -115 31 = 55534243 1cc50200
>>>> 00100000 80000a28 0000000d e0000008 00000000 000000
>>>> ffff8802fd7bf180 1135243601 C Bo:2:004:2 0 31 >
>>>> ffff88028df546c0 1135243607 S Bi:2:004:1 -115 4096 <
>>>> ffff88028df546c0 1165404890 C Bi:2:004:1 -104 4096 = e83ad4f0 096d0965
>>>> b6e2cf54 9165f0e8 2a39d865 8e097d4d 2bef792c e0e7adaf
>>>
>>>
>>> This shows the computer trying to read 4 KB of data from the device.
>>> All of the data was received okay, but for some reason the transfer
>>> didn't end properly.  Instead, it timed out after 30 seconds and was
>>> cancelled.  That's the fundamental bug.
>>>
>>> Attempts to recover by resetting the device failed (apparently due to a
>>> bug in the device) and from that point on, nothing worked.  The device
>>> kept reporting failures for each command, but with no error code.
>>>
>>> Since the original problem looks like an xHCI-related issue, maybe
>>> Mathias can suggest some things to try.
>>>
>>
>> Does this occur on xhci hosts from other verndors? How about older kernels?
>> (before 4.3)
>>
>> There was a change in 4.3 kernel (and older stable) in how the xhci driver
>> returns bulk in URBs. If transfers are short the driver won't give back the
>> URB
>> immediately, instead it waits until it get a completion event for the last
>> transfer block
>> in that transfer descriptor.
>> (we should get the event even if the transfer was short and never filled)
>>
>> Turns out not all hosts send this second completion event, so that change
>> will be reverted
>>
>> commit e210c422b6fdd2dc123bedc588f399aefd8bf9de
>>     xhci: don't finish a TD if we get a short transfer event mid TD
>>  Even if it is supposed to only affect short transfer with data over 64k the
>> symptoms you see
>> fits this area. If xhci driver doesn't return the URB, it will be canceled
>> with ECONNRESET (-104)status
>>
>> Does verbose debugging for xhci show anything?  Enable it with:
>> echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
>>
>> -Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html