On Mon, Feb 27, 2012 at 08:18:13PM +0100, guillaume.jaouen@xxxxxxx wrote: > Hello Jonathan, > > I followed your advices and try to track the usb 3.0 bug occuring during large transfer on usb 3.0 device with kernel 3.2.0.1. > > You'll find in attachements as asked : > > - full "dmesg" output from booting > - "lsusb -vvv" output Can you also run `sudo lspci -vvv` for me, so I know which host controller you have? > - syslog output (demonstrating what's happening exactly when file transfer crash at Feb 27 10:28:04) > - steps to reproduce : Just start 10~20 ftp sessions in order to transfer 60GB of data to an usb3.0 hdd I assume this is your USB 3.0 hard drive with the Lucid Port controller? > Bus 007 Device 002: ID 1759:5002 LucidPort Technology, Inc. > Device Descriptor: > bLength 18 > bDescriptorType 1 > bcdUSB 3.00 > bDeviceClass 0 (Defined at Interface level) > bDeviceSubClass 0 > bDeviceProtocol 0 > bMaxPacketSize0 9 > idVendor 0x1759 LucidPort Technology, Inc. > idProduct 0x5002 And it looks like you have an identical device attached? > Bus 007 Device 003: ID 1759:5002 LucidPort Technology, Inc. > Device Descriptor: > bLength 18 > bDescriptorType 1 > bcdUSB 3.00 > bDeviceClass 0 (Defined at Interface level) > bDeviceSubClass 0 > bDeviceProtocol 0 > bMaxPacketSize0 9 > idVendor 0x1759 LucidPort Technology, Inc. > idProduct 0x5002 > bcdDevice 21.54 > iManufacturer 1 USB 3.0 > iProduct 2 MassStorage Device > iSerial 3 JP2921HQ0T1MHA Are you trying to run raid across the two USB 3.0 drives? > - how reproducible it is : 100% of my tests on my computer produces a > crash like the one describe in previous syslog details By "crash" do you mean error messages from SMART like these? > Feb 27 07:36:53 debian kernel: [47463.948457] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint > Feb 27 07:36:53 debian kernel: [47463.949672] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint > Feb 27 07:36:53 debian anacron[12322]: Job `cron.daily' terminated (exit status: 1) (mailing output) > Feb 27 07:39:01 debian /USR/SBIN/CRON[13561]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete) > Feb 27 07:41:06 debian anacron[12322]: Normal exit (1 job run) > Feb 27 07:57:59 debian smartd[3607]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 118 > Feb 27 07:57:59 debian smartd[3607]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 53 to 54 > Feb 27 07:57:59 debian kernel: [48730.038307] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint > Feb 27 07:57:59 debian smartd[3607]: Device: /dev/sdb [SAT], failed to read SMART Attribute Data > Feb 27 07:57:59 debian kernel: [48730.040680] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint > Feb 27 07:57:59 debian kernel: [48730.043291] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint > Feb 27 07:57:59 debian smartd[3607]: Device: /dev/sdb [SAT], Read SMART Self Test Log Failed > Feb 27 07:57:59 debian kernel: [48730.047108] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint > Feb 27 07:57:59 debian smartd[3607]: Device: /dev/sdb [SAT], Read Summary SMART Error Log failed > Feb 27 07:57:59 debian kernel: [48730.048271] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint > Feb 27 07:57:59 debian kernel: [48730.049410] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint Or do you mean an actual kernel crash (oops)? Your dmesg didn't contain a crash (or any other error messages), and I can't get enough info out of your syslog files. Can you please only capture dmesg after running `sudo dmesg -n 8`? > - symptoms : During transfer the kernel xhci_hcd driver complain about > "Stalled endpoint, then the transfer became to time out "Transfer > error on endpoint" That means your USB device stalled several SCSI commands. As I said, that's really not an "error". The USB device just didn't like a particular SCSI command, so it stalled the command. The SCSI core kind of strugs its shoulders and continues talking to the device. So the warnings about stalled endpoints are harmless. You could enable usb-storage debugging and see which SCSI commands the device is stalling, but I don't think that's your underlying issue with the failure with the drives. BTW, what are these errors in your dmesg about? If some userspace program is sending bad SCSI commands to the device, then it's no wonder the device was stalling the commands: > [ 2.265783] md: raid10 personality registered for level 10 > [ 2.284315] mdadm: sending ioctl 800c0910 to a partition! > [ 2.284377] mdadm: sending ioctl 800c0910 to a partition! > [ 2.284436] mdadm: sending ioctl 1261 to a partition! > [ 2.284487] mdadm: sending ioctl 1261 to a partition! > [ 2.284965] mdadm: sending ioctl 1261 to a partition! > [ 2.285016] mdadm: sending ioctl 1261 to a partition! > [ 2.285379] mdadm: sending ioctl 1261 to a partition! > [ 2.285431] mdadm: sending ioctl 1261 to a partition! > [ 2.285809] mdadm: sending ioctl 1261 to a partition! > [ 2.285860] mdadm: sending ioctl 1261 to a partition! Later something like electrical noise caused a transfer error on the bus (or the device might have been simply sending garbage). It's possible the xHCI host controller actually caused the transfer error as well. You could try the USB 3.0 mass storage devices under EHCI, but the results wouldn't be too conclusive, since you're probably not stressing the drive enough because the USB 2.0 speed is 10x slower than USB 3.0. > - how they differ from what you expect : i would expect that transfer > of large data sets even if they're done slowly (1Mb/s due to isp > bandwith restriction) wouldn't time out and should complete correctly. It's possible your USB storage device is just broken, or your host controller. Also, did you mention the USB 3.0 host controller is an Express Card? Because Express Cards are limited to only one PCIe lane, so they will restrict your USB 3.0 bandwidth. A PCIe add-in card is better, and integrated PCIe is best for increased bandwidth. > - background : I had this bug since I used 2.6 kernel branch. Under > 2.6, these bug make the whole kernel freeze and the only way to > recover from it was to hard reset the computer. > - link to the bug log for the full story : http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=596528 That bug report says you were running 2.6.32. That was the first kernel to have USB 3.0 support, and everything was beta. A ton of new features were added and bug fixes when into the kernel in the last two and a half years since 2.6.32 has been released, and I'm not sure how much Debian has backported. I would suggest just using the latest stable kernel instead. Let's focus on fixing your problem with 3.2. > Since I'm under 3.2 kernel branch, things are better as now I only > have to unmount the devices under root, shutdown external hdd, remove > the usb 3.0 express card in order to make the xhci_hcd driver reload > into the kernel, restart external hdd and mount them back. Does it work if you don't eject the express card and just plug the USB 3.0 devices back in? Does it work if you just unload the xHCI driver and reload it? If not, I would suspect your host controller is hosed (although the xHCI driver possibly could have done something to wedge it). You shouldn't have to eject the express card. I really needed your dmesg when this happened, not the rsyslog: Feb 27 10:28:00 debian smartd[3607]: Device: /dev/sdb [SAT], Read SMART Self Test Log Failed Feb 27 10:28:00 debian smartd[3607]: Device: /dev/sdb [SAT], Read Summary SMART Error Log failed Feb 27 10:28:00 debian kernel: [57730.360624] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint Feb 27 10:28:00 debian kernel: [57730.361826] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint Feb 27 10:28:04 debian kernel: [57734.270107] xhci_hcd 0000:0d:00.0: WARN: Stalled endpoint Feb 27 10:28:04 debian kernel: [57734.270478] xhci_hcd 0000:0d:00.0: WARN: transfer error on endpoint Feb 27 10:28:24 debian kernel: [57754.380054] xhci_hcd 0000:0d:00.0: Timeout while waiting for address device command Feb 27 10:28:44 debian kernel: [57774.584078] xhci_hcd 0000:0d:00.0: Timeout while waiting for address device command Feb 27 10:28:44 debian kernel: [57774.788116] usb 7-1: device not accepting address 3, error -62 Feb 27 10:29:04 debian kernel: [57794.900060] xhci_hcd 0000:0d:00.0: Timeout while waiting for reset device command Feb 27 10:29:04 debian kernel: [57794.900074] usb 7-1: Cannot reset HCD device state Feb 27 10:29:24 debian kernel: [57815.124102] xhci_hcd 0000:0d:00.0: Timeout while waiting for reset device command Feb 27 10:29:24 debian kernel: [57815.124116] usb 7-1: Cannot reset HCD device state Feb 27 10:29:45 debian kernel: [57835.236089] xhci_hcd 0000:0d:00.0: Timeout while waiting for reset device command Feb 27 10:29:45 debian kernel: [57835.236103] usb 7-1: Cannot reset HCD device state Feb 27 10:29:45 debian kernel: [57835.236217] usb 7-1: USB disconnect, device number 3 Feb 27 10:29:45 debian kernel: [57835.236417] sd 5:0:0:0: Device offlined - not ready after error recovery Feb 27 10:29:45 debian kernel: [57835.236901] scsi 5:0:0:0: [sdd] killing request Feb 27 10:29:45 debian kernel: [57835.236911] scsi 5:0:0:0: [sdd] killing request Feb 27 10:29:45 debian kernel: [57835.243312] scsi 5:0:0:0: [sdd] Unhandled error code Feb 27 10:29:45 debian kernel: [57835.243320] scsi 5:0:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Feb 27 10:29:45 debian kernel: [57835.243329] scsi 5:0:0:0: [sdd] CDB: Write(10): 2a 00 dc b9 be a8 00 00 f0 00 Feb 27 10:29:45 debian kernel: [57835.243349] end_request: I/O error, dev sdd, sector 3703160488 Feb 27 10:29:45 debian kernel: [57835.243417] scsi 5:0:0:0: [sdd] Unhandled error code Feb 27 10:29:45 debian kernel: [57835.243422] scsi 5:0:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Feb 27 10:29:45 debian kernel: [57835.243429] scsi 5:0:0:0: [sdd] CDB: Write(10): 2a 00 dc b9 bf 98 00 00 f0 00 Feb 27 10:29:45 debian kernel: [57835.243448] end_request: I/O error, dev sdd, sector 3703160728 I'm guessing a bit, but it looks like a transfer error happened, and then the usb-storage driver tried to reset the device. Then it looks like the host controller just stopped responding to commands (including the Set Address command and Reset Device command). That's what the "Timeout...command" messages are about. I can't be sure that your host controller is the thing that's broken unless you rebuild your kernel with CONFIG_USB_DEBUGGING and CONFIG_USB_XHCI_HCD_DEBUGGING turned on, and capture the full dmesg starting just before that transfer error. You'll really want to be running 3.3, since that cleaned up a lot of the xHCI driver debugging, and the log file will be much smaller. It's possible, although not likely, that we're over writing the link TRB on the command ring and causing the host controller to step off into lala land and access bad memory. My other theory is that your express card is just broken and can't handle the throughput. Or perhaps it's an early prototype that made it out into the market without having good transfer error support. Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html