Re: EPROTO when USB 3 GbE adapters are under load

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Thu, 25 Oct 2018 13:28:20 -0400 (EDT)

On Thu, 25 Oct 2018, Mathias Nyman wrote:

> On 25.10.2018 12:52, Hao Wei Tee wrote:
> > On 25/10/18 4:45 PM, Mathias Nyman wrote:
> >> Reproducing the issue with a recent kernel with xhci traces enabled should show the reason for EPROTO error.
> >>
> >> Add xhci traces before triggering the issue with:
> >>
> >> mount -t debugfs none /sys/kernel/debug
> >> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
> >> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
> >>
> >> after issue is triggered save and send the trace at /sys/kernel/debug/tracing/trace
> >> Note that it might be huge
> > 
> > Thanks for the suggestion.
> > 
> > Here[1] is (part of) the trace starting about 250 lines before the EPROTO happens.
> > 
> > [1]: https://gist.githubusercontent.com/angelsl/fdd04d2bded3a41029122b0536c00944/raw/b8e9f7d2695ac030b7f3dd53a1a9c3f37da7b7a0/trace
> > 
> > The first error happens at line 243 (timestamp 8144.248398) coinciding with the start of errors spewed into dmesg:
> > 
> > [ 8144.245359] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
> > [ 8144.248837] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
> > [ 8144.252392] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
> > [ 8144.255987] r8152 2-2:1.0 enp0s20f0u2: Stop submitting intr, status -71
> 
> Thanks,
> xHC controller reports that there was a transaction error on one of the bulk TRBs.
> 
> The transaction error causes the endpoint to halt (host side halt only).
> Xhci driver resets the host side endpoint to recover from the halt,
> then returns the broken URB (TRB) with -EPROTO status, and then moves past this TRB.

The host side of the endpoint should remain stopped until after the
URB's completion routine has had a chance to carry out error recovery.  
Doesn't this imply the xHCI driver shouldn't reset the host-side
endpoint until after the giveback call returns?

> Interesting thing here is that each TRB in the queue after the transaction error
> also triggers a transaction error.
>  
> This might be a data toggle/sequence number sync issue.

It's more likely to be a problem on the device side.  Data toggle or
sequence number issues tend to be self-repairing (albeit with some data
loss) after a little while.

> The host side endpoint reset clears the host side sequence number,
> and host expects device side endpoint to be reset and sequence to be cleared as well
> as a result of returning -EPROTO.
> If I remember correctly xhci driver does not wait for device side endpoint to be reset,
> so if there are  TRBs in the queue they will be transferred, with a cleared sequence number
> out of sync with the device side.

That's why it's important to wait until after the higher-layer driver 
has had a chance to unlink the URBs that may be in the endpoint queue.  
The driver may even want to reset the device.

> There is a patch in usb-next that might help.
> f8f80be xhci: Use soft retry to recover faster from transaction errors
> 
> It soft resets the halted host side endpoint, clears the halt without clearing the sequence number.
> 
> -Mathias

Alan Stern