Re: Spurious Mass Storage Device Resets

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Fri, 4 Mar 2016 16:52:08 -0500 (EST)

On Fri, 4 Mar 2016, Rian Hunter wrote:

> Thanks for the great tip. I used tcpdump on the bus of my device and
> waited for a couple days for the effect to happen again.
> 
> I found that whenever I would get the "usb 2-3: reset SuperSpeed USB
> device number 2 using xhci_hcd" what was happening at the protocol
> level was:
> 
> HOST: DO READ
> DEVICE: CONFIRMED
> HOST: SEND ME DATA
> DEVICE: <DATA>
> HOST: SEND ME STATUS
> DEVICE: SCSI CHECK CONDITION
> 
> After the "Check Condition" the stack would initial a USB reset.

More details would help, such as the actual usbmon output for one of 
those failed commands (plus the following data).

> Now that was all well and fine, no interruptions in service. What
> eventually caused the entire device to disconnect was this sequence:
> 
> HOST: DO READ
> DEVICE: CONFIRMED
> HOST: SEND ME DATA
> DEVICE: <DATA>
> HOST: SEND ME STATUS
> (300 seconds pass...)
> DEVICE: USB URB ECONNRESET
> 
> I believe what is happening here is that the firmware of the bridge
> device timed out waiting for the SCSI status coming from the actual
> HDD (after enough "check condition" return codes the device decided to
> die). The bridge firmware sends a final ECONNRESET then it decides to
> disconnect completely.

No, that doesn't sound right.  SCSI commands typically have a 30-second 
timeout, so there should have been a reset after 30 seconds, not a 
disconnect after 300.

> None of this surprises me. The theory that the drives are bad seems
> to have more data behind it.
> 
> From "https://en.wikipedia.org/wiki/SCSI_check_condition"; it says that
> when a "check condition" status is returned, the device goes into
> a special "contigent allegiance condition" state and the host *should*
> retrieve more information using a "Request Sense" command. The Linux
> stack does not seem to be doing this.

Not true.  It does do this, very faithfully.

>  While it probably couldn't save
> the device, it would be extremely valuable information to see in the
> logs, especially when the HDDs are operating normally otherwise. Probably
> would be worth it to add a special case in usb_stor_invoke_transport()
> to handle the "check condition" error more gracefully.

That code is already in there.

> At this point I'd really like to see what information is contained the
> "Request sense" response. Going to hunt for a user-space solution to pull
> that for now before I order new drives :)

It should all be in your usbmon output.  If you would post that here 
(just the relevant portions, not the whole several-day-long trace), I 
could tell you more.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html