Re: [PATCH 0/2] xhci: Fix the NEC stop bug workaround

Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> · Thu, 31 Oct 2024 16:22:14 +0200

On 31.10.2024 13.17, Michał Pecio wrote:
Update:

Your patch prints one dev_dbg() each time, mine spams many of them for
100ms each time. I will remove this one retry limit from your patch to
see if starts spinning infinitely, but I strongly suspect it will.

Yes, that's exactly what happens.

This time I have killed the ifconfig loop, unplugged the NIC and
started 'rmmod xhci_pci', which is still hanging 10 minutes later.

So business as usual when these things go wrong.

One retry is not enough. This is what I got on the first try with a
random UVC webcam:
[...]

Ok, good to know, then using flag is not enough.

Using a retry timeout for failed stop endpoint commands also sounds good
to me.
It has a slight downside of a possible 100ms aggressive 'Stop Endpoint'
retry loop in cases where endpoint was stopped earlier for some other reason.

Not sure if that is a problem, if it is then we can add the flag and only
retry for 100ms if flag is set (only clear flag in handle_tx_event())

Another reason for the flag is the additional note in xhci 4.8.3 [1], we might
need to track the state better in software.

[1] xhci 4.8.3 Endpoint Context state

"There are several cases where the EP State field in the Output Endpoint Context
may not reflect the current state of an endpoint. The xHC should attempt to
keep EP State as current as possible, however it may defer these updates to
perform higher priority references to memory, e.g. Isoch data transfers, etc.
Software should maintain an internal variable that tracks the state of an
endpoint and not depend on EP State to represent the instantaneous state of
an endpoint.
For example, when a Command that affects EP State is issued, the value of EP
State may be updated anytime between when software rings the Command
Ring doorbell for a command and when the associated Command Completion
Event is placed on the Event Ring by the xHC. The update of EP State may also
be delayed relative to a Doorbell ring or error condition (e.g. TRB Error, STALL,
or USB Transaction Error) that causes an EP State change not generated by a
command.

Software should maintain an accurate value for EP State, by tracking it with an
internal variable that is driven by Events and Doorbell accesses associated with
an endpoint using the following method:

• When a command is issued to an endpoint that affects its state, software
should use the Command Completion Event to update its image of EP State
to the appropriate state.
• When a Transfer Event reports a TRB Error, software should update its image
of EP State to Error.
• When a Transfer Event reports a Stall Error or USB Transaction Error,
software should update its image of EP State to Halted.
• When software rings the Doorbell of an endpoint to transition it from the
Stopped to Running state, it should update its image of EP State to Running."

Thanks
-Mathias