Re: EPROTO when USB 3 GbE adapters are under load

Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> · Fri, 26 Oct 2018 16:37:31 +0300

On 25.10.2018 20:28, Alan Stern wrote:
On Thu, 25 Oct 2018, Mathias Nyman wrote:

On 25.10.2018 12:52, Hao Wei Tee wrote:
On 25/10/18 4:45 PM, Mathias Nyman wrote:
Reproducing the issue with a recent kernel with xhci traces enabled should show the reason for EPROTO error.

Add xhci traces before triggering the issue with:

mount -t debugfs none /sys/kernel/debug
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable

after issue is triggered save and send the trace at /sys/kernel/debug/tracing/trace
Note that it might be huge

Thanks for the suggestion.

Here[1] is (part of) the trace starting about 250 lines before the EPROTO happens.

[1]: https://gist.githubusercontent.com/angelsl/fdd04d2bded3a41029122b0536c00944/raw/b8e9f7d2695ac030b7f3dd53a1a9c3f37da7b7a0/trace

The first error happens at line 243 (timestamp 8144.248398) coinciding with the start of errors spewed into dmesg:

[ 8144.245359] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
[ 8144.248837] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
[ 8144.252392] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
[ 8144.255987] r8152 2-2:1.0 enp0s20f0u2: Stop submitting intr, status -71

Thanks,
xHC controller reports that there was a transaction error on one of the bulk TRBs.

The transaction error causes the endpoint to halt (host side halt only).
Xhci driver resets the host side endpoint to recover from the halt,
then returns the broken URB (TRB) with -EPROTO status, and then moves past this TRB.

The host side of the endpoint should remain stopped until after the
URB's completion routine has had a chance to carry out error recovery.
Doesn't this imply the xHCI driver shouldn't reset the host-side
endpoint until after the giveback call returns?

True, on xhci side we could probably reset the endpoint, and even move the
dequeue pointer to the next TRB, but make sure the endpoint is not restarted yet.

The URB with -EPROTO status is given back in interrupt context, so this might limit
a bit what the higher-layer drivers can do in giveback.

Now thinking about it, xhci driver calls the URB giveback in the same "Transaction error"
interrupt handler, after first queuing areset endpoint and a set TR Deq pointer command.
The endpoint is only restarted after those commands finish, in the command completion interrupt
handler.

So in that sense the endpoint shouldn't be restarted until the next interrupt is handled,
which shouldn't be possible before the URB giveback call returned in the previous interrupt handler.

Well, at least not as long as we are in hard interrupt.

I think I need to dig a bit more into this.

Interesting thing here is that each TRB in the queue after the transaction error
also triggers a transaction error.

This might be a data toggle/sequence number sync issue.

It's more likely to be a problem on the device side.  Data toggle or
sequence number issues tend to be self-repairing (albeit with some data
loss) after a little while.

Ok, thanks, not spending too much time looking into that then.

The host side endpoint reset clears the host side sequence number,
and host expects device side endpoint to be reset and sequence to be cleared as well
as a result of returning -EPROTO.
If I remember correctly xhci driver does not wait for device side endpoint to be reset,
so if there are  TRBs in the queue they will be transferred, with a cleared sequence number
out of sync with the device side.

That's why it's important to wait until after the higher-layer driver
has had a chance to unlink the URBs that may be in the endpoint queue.
The driver may even want to reset the device.

Would it make sense to prevent endpoint from running until usb core calls
hcd->driver->endpoint_reset?
That is for halted endpoints, that returned URB with -EPROTO status.

-Mathias