Hi, On Wed, 15 Nov 2017 18:17:08 -0500 Jérôme Carretero <cJ-ko@xxxxxxxxxxx> wrote: > Hi, > > > On Thu, 16 Nov 2017 07:40:08 +0900 > James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > On Wed, 2017-11-15 at 17:02 -0500, Alan Stern wrote: > > > On Wed, 15 Nov 2017, Jérôme Carretero wrote: > > > > > Because with several of these drives / lots of > > > > >activity / > > > occasional > > > > > issues, it looks like it will be hard to catch (yes I can > > > > >use > > > > > usbmon). > > > > > > > > > > - It looks like there is no configurable timeout for USB > > > > > MSC > > > requests. > > > > > Perhaps the device is not responding in time and this is > > > > >why > > > it's > > > > > reset? > > > > > > Timeouts are set by the SCSI layer. I believe they are rather > > > long (30 seconds, by default). Presumably they are configurable, > > > although I would have to do some digging to figure out how. > > > > They're in /sys/class/scsi_device/<id>/device/timeout > > > I'll use wireshark to check the cause: for sure, the drives are not > "timing out" after 30 seconds (indeed the reported value > in /sys/class/scsi_device/.../timeout or /sys/block/*/device/timeout), > because I see (in dstat) that a disk is busy until the right about the > moment where its reset message appears. > > Is it possible that the SCSI timeout doesn't get set into an USB URB > timeout (I'll check by myself, but asking doesn't hurt) ? I performed an usbmon capture extract, centered around the event (there was a few hundred MBs written for this to happen): Nov 15 22:16:33 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 8 using xhci_hcd I can see that the computer is sending a write request, and sees a -EPROTO in answer (capture in attachment), so scratch the timeout issue (and actually when thinking about it, this matches what UAS was saying, except that UAS was taking ages to recover). Looked for EPROTO in the usb code, and found a dynamic debug printf in XHCI; after enabling it: Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 13 ep 3 on endpoint Nov 15 22:45:03 Bidule kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 12 ep 3 on endpoint Nov 15 22:45:03 Bidule kernel: usb 6-4.3.3.1: reset SuperSpeed USB device number 9 using xhci_hcd Nov 15 22:45:03 Bidule kernel: usb 6-4.3.2.1: reset SuperSpeed USB device number 8 using xhci_hcd First, I understand that a bad USB device could poison the kernel log, but shouldn't that xhci_dbg() (and others eg. babble) be at least an xhci_info() (I saw 2a9227a5)? Then... I don't know enough to attribute the issue the upstream USB hub(s) or the drive endpoint not behaving properly, or the kernel... what should I do with these messages? I'm still filling the drives, will perform a scrub after, to see if the issue causes data loss... -- Jérôme PS: BTW, thanks a lot for the help so far. PPS: It would be so nice if someone from Seagate was reading this.
Attachment:
smr-reset-excerpt.pcapng.gz
Description: application/gzip