Re: Seagate External SMR drive USB resets (XHCI transfer error, not timeout)

Jérôme Carretero <cJ-ko@xxxxxxxxxxx> · Wed, 15 Nov 2017 23:21:29 -0500

Hi,

On Wed, 15 Nov 2017 18:17:08 -0500
Jérôme Carretero <cJ-ko@xxxxxxxxxxx> wrote:

> Hi,
> 
> 
> On Thu, 16 Nov 2017 07:40:08 +0900
> James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> 
> > On Wed, 2017-11-15 at 17:02 -0500, Alan Stern wrote:  
> > > On Wed, 15 Nov 2017, Jérôme Carretero wrote:    
> > > > >   Because with several of these drives / lots of
> > > > >activity /    
> > > occasional    
> > > > >   issues, it looks like it will be hard to catch (yes I can
> > > > >use
> > > > > usbmon).
> > > > > 
> > > > > - It looks like there is no configurable timeout for USB
> > > > > MSC    
> > > requests.    
> > > > >   Perhaps the device is not responding in time and this is
> > > > >why    
> > > it's    
> > > > >   reset?    
> > > 
> > > Timeouts are set by the SCSI layer.  I believe they are rather
> > > long (30 seconds, by default).  Presumably they are configurable,
> > > although I would have to do some digging to figure out how.    
> > 
> > They're in /sys/class/scsi_device/<id>/device/timeout  
> 
> 
> I'll use wireshark to check the cause: for sure, the drives are not
> "timing out" after 30 seconds (indeed the reported value
> in /sys/class/scsi_device/.../timeout or /sys/block/*/device/timeout),
> because I see (in dstat) that a disk is busy until the right about the
> moment where its reset message appears.
> 
> Is it possible that the SCSI timeout doesn't get set into an USB URB
> timeout (I'll check by myself, but asking doesn't hurt) ?

I performed an usbmon capture extract, centered around the event
(there was a few hundred MBs written for this to happen):

 Nov 15 22:16:33 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB
  device number 8 using xhci_hcd

I can see that the computer is sending a write request, and sees a
-EPROTO in answer (capture in attachment), so scratch the timeout issue
(and actually when thinking about it, this matches what UAS was saying,
except that UAS was taking ages to recover).

Looked for EPROTO in the usb code, and found a dynamic debug printf in
XHCI; after enabling it:

 Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 13 ep 3 on endpoint
 Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 12 ep 3 on endpoint
 Nov 15 22:45:03 Bidule kernel: usb 6-4.3.3.1: reset SuperSpeed USB device number 9 using xhci_hcd
 Nov 15 22:45:03 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 8 using xhci_hcd

First, I understand that a bad USB device could poison the kernel log,
but shouldn't that xhci_dbg() (and others eg. babble) be at least an
xhci_info() (I saw 2a9227a5)?

Then... I don't know enough to attribute the issue the upstream USB hub(s)
or the drive endpoint not behaving properly, or the kernel... what
should I do with these messages?

I'm still filling the drives, will perform a scrub after, to see if
the issue causes data loss...

-- 
Jérôme

PS: BTW, thanks a lot for the help so far.
PPS: It would be so nice if someone from Seagate was reading this.
Attachment:
smr-reset-excerpt.pcapng.gz

Description: application/gzip