On Thu, 25 Oct 2018, Mathias Nyman wrote: > On 25.10.2018 12:52, Hao Wei Tee wrote: > > On 25/10/18 4:45 PM, Mathias Nyman wrote: > >> Reproducing the issue with a recent kernel with xhci traces enabled should show the reason for EPROTO error. > >> > >> Add xhci traces before triggering the issue with: > >> > >> mount -t debugfs none /sys/kernel/debug > >> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb > >> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable > >> > >> after issue is triggered save and send the trace at /sys/kernel/debug/tracing/trace > >> Note that it might be huge > > > > Thanks for the suggestion. > > > > Here[1] is (part of) the trace starting about 250 lines before the EPROTO happens. > > > > [1]: https://gist.githubusercontent.com/angelsl/fdd04d2bded3a41029122b0536c00944/raw/b8e9f7d2695ac030b7f3dd53a1a9c3f37da7b7a0/trace > > > > The first error happens at line 243 (timestamp 8144.248398) coinciding with the start of errors spewed into dmesg: > > > > [ 8144.245359] r8152 2-2:1.0 enp0s20f0u2: Rx status -71 > > [ 8144.248837] r8152 2-2:1.0 enp0s20f0u2: Rx status -71 > > [ 8144.252392] r8152 2-2:1.0 enp0s20f0u2: Rx status -71 > > [ 8144.255987] r8152 2-2:1.0 enp0s20f0u2: Stop submitting intr, status -71 > > Thanks, > xHC controller reports that there was a transaction error on one of the bulk TRBs. > > The transaction error causes the endpoint to halt (host side halt only). > Xhci driver resets the host side endpoint to recover from the halt, > then returns the broken URB (TRB) with -EPROTO status, and then moves past this TRB. The host side of the endpoint should remain stopped until after the URB's completion routine has had a chance to carry out error recovery. Doesn't this imply the xHCI driver shouldn't reset the host-side endpoint until after the giveback call returns? > Interesting thing here is that each TRB in the queue after the transaction error > also triggers a transaction error. > > This might be a data toggle/sequence number sync issue. It's more likely to be a problem on the device side. Data toggle or sequence number issues tend to be self-repairing (albeit with some data loss) after a little while. > The host side endpoint reset clears the host side sequence number, > and host expects device side endpoint to be reset and sequence to be cleared as well > as a result of returning -EPROTO. > If I remember correctly xhci driver does not wait for device side endpoint to be reset, > so if there are TRBs in the queue they will be transferred, with a cleared sequence number > out of sync with the device side. That's why it's important to wait until after the higher-layer driver has had a chance to unlink the URBs that may be in the endpoint queue. The driver may even want to reset the device. > There is a patch in usb-next that might help. > f8f80be xhci: Use soft retry to recover faster from transaction errors > > It soft resets the halted host side endpoint, clears the halt without clearing the sequence number. > > -Mathias Alan Stern