Hi Again Mathias, I did a system update yesterday (using gentoo, so I build everything with distcc), and managed to get it to crash. [48292.615897] xhci_hcd 0000:01:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13 [48292.615902] xhci_hcd 0000:01:00.0: Looking for event-dma 00000001d91f5050 trb-start 00000001d91f5030 trb-end 00000001d91f5030 seg-start 00000001d91f5000 seg-end 00000001d91f5ff0 I'm suspecting that this happens when I use a combination of USB devices; this happened when I was browsing the web while compiling. Also it happened _after_ I'd finished compiling all packages, while I was running something over NFS - I don't believe the network throughput was very high at that point. The trace file compresses to about 30 Meg (800 compressed), so I will send you a link directly. Luciano On Sun, Mar 4, 2018 at 3:07 PM, <ljoublanc@xxxxxxxxx> wrote: > Hi Mathias, > > Thank you kindly for your feedback - I appreciate the time you took to > help me out. > > Unfortunately, in a rather bizarre turn of events, I'm unable to > reproduce the issue. I'm stumped as to why it's not occurring any > more. I've tried a number of things including the distributed compile, > nfs, copying data with `dd`, and I can see up to a gigabit going over > the network, but still I can't get it to crash! I've tried with both > the trace enabled and disabled. Previously it was consistently > reproducible, occurring under a minute after I began the compile - now > it ran for a good 30 mins without interruption. Something must have > obviously changed, but I can't tell what it is. > > Admittedly I did only try this with 4.14.22 (not 4.9), but this did > occur with the later version a few days ago (I checked syslogs). > > Now that I know what it is that you need to debug this, I will respond > to this thread if I'm ever able to reproduce this again. > > Thanks again, > > Luciano > > On Fri, Mar 2, 2018 at 12:41 PM, Mathias Nyman > <mathias.nyman@xxxxxxxxxxxxxxx> wrote: >> On 01.03.2018 22:56, ljoublanc@xxxxxxxxx wrote: >>> >>> I'm a user of gentoo linux, and would like to raise a bug report. I hope >>> these >>> are the correct channels; if not, I apologise in advance. >>> >>> I'm using a realtek-based USB3 to RJ45 gigabit adapter. This plugs >>> directly into my laptop, which is a Toshiba Radius P20W-C-103, skylake >>> based, with the following controller: >>> >>> ``` >>> 00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 >>> xHCI Controller (rev 21) >>> ``` >>> >>> I am experiencing this on 4.9.79-r1, and also 4.14.22. >>> >>> When I plug the device in, unless I disable power management on USB >>> hubs 3 and 4, I get errors saying 'root hub lost power or was reset'. >>> However, if I disable PM using powertop, I get the device to work >>> seemingly well. But, as soon as I start heavy transfers (in my case >>> distributed compile), the network device stops responding >>> >>> The error messages that I'm receiving are very similar to a bug described >>> here: https://bugs.launchpad.net/dell-sputnik/+bug/1729674 >>> But I've been told that it's unrelated. >>> >>> First, these are the logs showing the dongle being plugged in. >>> ``` >>> Feb 26 20:17:09 nizuc kernel: usb usb3: root hub lost power or was reset >>> Feb 26 20:17:09 nizuc kernel: usb usb4: root hub lost power or was reset >>> Feb 26 20:17:41 nizuc kernel: usb 4-1: new SuperSpeed USB device >>> number 2 using xhci_hcd >>> Feb 26 20:17:41 nizuc kernel: usb 4-1: New USB device found, >>> idVendor=0bda, idProduct=8153 >>> Feb 26 20:17:41 nizuc kernel: usb 4-1: New USB device strings: Mfr=1, >>> Product=2, SerialNumber=6 >>> Feb 26 20:17:41 nizuc kernel: usb 4-1: Product: USB 10/100/1000 LAN >>> Feb 26 20:17:41 nizuc kernel: usb 4-1: Manufacturer: Realtek >>> Feb 26 20:17:41 nizuc kernel: usb 4-1: SerialNumber: 000001 >>> Feb 26 20:17:41 nizuc kernel: usb 4-1: reset SuperSpeed USB device >>> number 2 using xhci_hcd >>> Feb 26 20:17:41 nizuc NetworkManager[2049]: <info> [1519676261.9009] >>> manager: (eth0): new Ethernet device >>> (/org/freedesktop/NetworkManager/Devices/5) >>> Feb 26 20:17:41 nizuc kernel: r8152 4-1:1.0 eth0: v1.09.9 >>> Feb 26 20:17:42 nizuc mtp-probe[3673]: checking bus 4, device 2: >>> "/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/usb4/4-1" >>> Feb 26 20:17:42 nizuc mtp-probe[3673]: bus: 4, device: 2 was not an MTP >>> device >>> Feb 26 20:17:42 nizuc upowerd[2168]: unhandled action 'bind' on >>> /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/usb4/4-1 >>> Feb 26 20:17:42 nizuc systemd-udevd[3676]: link_config: >>> autonegotiation is unset or enabled, the speed and duplex are not >>> writable. >>> Feb 26 20:17:42 nizuc kernel: r8152 4-1:1.0 enp1s0u1: renamed from eth0 >>> Feb 26 20:17:42 nizuc NetworkManager[2049]: <info> [1519676262.2645] >>> device (eth0): interface index 4 renamed iface from 'eth0' to >>> 'enp1s0u1' >>> Feb 26 20:17:42 nizuc kernel: IPv6: ADDRCONF(NETDEV_UP): enp1s0u1: >>> link is not ready >>> Feb 26 20:17:42 nizuc NetworkManager[2049]: <info> [1519676262.2829] >>> device (enp1s0u1): state change: unmanaged -> unavailable (reason >>> 'managed', internal state 'external') >>> Feb 26 20:17:42 nizuc upowerd[2168]: unhandled action 'bind' on >>> /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/usb4/4-1/4-1:1.0 >>> Feb 26 20:17:42 nizuc kernel: IPv6: ADDRCONF(NETDEV_UP): enp1s0u1: >>> link is not ready >>> Feb 26 20:17:46 nizuc kernel: r8152 4-1:1.0 enp1s0u1: carrier on >>> Feb 26 20:17:46 nizuc kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0u1: >>> link becomes ready >>> ``` >>> >>> but once I begin distrubted compile, the network drops, with the following >>> messages: >>> >>> ``` >>> Feb 26 20:25:32 nizuc distccd[13928]: (dcc_r_token_int) got ARGV0000000f >>> Feb 26 20:25:32 nizuc distccd[13928]: (dcc_r_token_string) got >>> '-Wa,-mtune=i686' >>> Feb 26 20:25:32 nizuc distccd[13928]: (dcc_r_argv) argv[22] = >>> "-Wa,-mtune=i686" >>> Feb 26 20:25:32 nizuc distccd[13928]: (dcc_r_token_int) got ARGV00000011 >>> Feb 26 20:25:32 nizuc kernel: xhci_hcd 0000:01:00.0: ERROR Transfer >>> event TRB DMA ptr not part of current TD ep_index 2 comp_code 13 >>> Feb 26 20:25:32 nizuc kernel: xhci_hcd 0000:01:00.0: Looking for >>> event-dma 000000020c8aea40 trb-start 000000020c8aea20 trb-end >>> 000000020c8aea20 seg-start 000000020c8ae000 s> >> >> >> From xhci usb3 host controller point of view there are a couple usb transfer >> blocks (TRB) sitting on the transfer ring that were not handled (at ..ea20 >> and --ea30). >> but we get transfer events for the following transfer blocks starting (at >> ..ea40) >> >> Driver can't handle TRBs finishing out of order. >> >>> Feb 26 20:25:32 nizuc kernel: xhci_hcd 0000:01:00.0: ERROR Transfer >>> event TRB DMA ptr not part of current TD ep_index 2 comp_code 13 >>> Feb 26 20:25:32 nizuc kernel: xhci_hcd 0000:01:00.0: Looking for >>> event-dma 000000020c8aea50 trb-start 000000020c8aea20 trb-end >> >> >> This was for the next tranfer block (at ..ea50) so hw xHC hardware >> seems to proceed normally anfter two TRBs are skipped >> >>> >>> I hope somebody here can direct me as to how to best proceed. Should I try >>> modifying the aforementioned patch and see if it helps (note the patch >>> only >>> activates for certain usb vendor/device IDs). Also, the ubuntu link >>> suggests >>> that this may be the result of "offloading checksuming" which can be >>> disabled >>> with ethtool. Should I try disabling this and see if it makes a >>> difference? >>> Many thanks for your patience. Please let me know if there is any more >>> information I can provide. >> >> >> To really know what's happening to those two TRBs I would need xhci traces. >> Trace file will be huge, use 4.14 or later kernel >> >> mount -t debugfs none /sys/kernel/debug >> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb >> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable >> >> do your distributed compile until you see the >> "ERROR Transfer event TRB DMA ptr not part of current TD" >> >> and then send me the content of /sys/kernel/debug/tracing/trace >> >> Thanks >> -Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html