Re: [PATCH 0/2] xhci: Fix the NEC stop bug workaround

Michał Pecio <michal.pecio@xxxxxxxxx> · Thu, 31 Oct 2024 09:13:47 +0100

Hi,

I will send out v3 of my own patches soon.

I also plan to research a radically different solution, which is simply
to prevent failed Stop Endpoint in the first place. General idea:

1. If commands pending, "outsource" the work to their handlers.
2. If EP stopped for other reason, invalidate/giveback immediately.
3. If Context State != Stopped, queue Stop Endpoint.
4. If Context State == Stopped but shouldn't, set up a delayed work.
5. The work looks at Context State and goto 3.
6. The work may choose to give up retrying after some time.
7. The work could even act as watchdog and call hc_died() if Set Deq
   or Reset EP get stuck in retry or abort loops (not seen so far).

On Wed, 30 Oct 2024 10:29:12 +0200, Mathias Nyman wrote:
> 1st patch avoids stopping endpoint in urb cancel if Set TR Deq is
> pending
> 2nd patch handles Set TR Deq command ctx error due to running ep.
> 3rd patch tracks doorbell ring with a flag. It's for now only
> used to prevent infinite stop ep retries. Flag is not properly
> cleared in other cases.

Quick comments:

1. To be specific, my suggestion was to make the function work even
   if called under SET_DEQ_PENDING rather than try to avoid this. It
   ends up simpler IMO and solves any risk of calls from other places.

   Then the change in xhci_urb_dequeue() becomes an optimization only,
   which is not required for correct operation, may be combined with
   other similar optimizations, or even reverted if necessary without
   breaking multiple stream cancellations again.

2. Mixed feelings. Adds complexity, obviously. Shouldn't be necessary
   if the retries prove to work on other chips (I have never had a Set
   TR Deq error on NEC with the workaround). Could help otherwise.

3. No need to pollute handle_tx_event() at all, because the flag only
   needs to be guaranteed false when the EP is Stopped, and this can
   only happen by *successful* execution of Stop EP or Reset EP. Easy
   to detect this in handle_cmd_completion().

   But I'm not sure if a new flag is needed at all. Its value will
   simply be negation of the condition in xhci_ring_ep_doorbell(),
   except for cases when we "forget" to ring EP doorbell, which are
   probably bugs that should be fixed.

Bugs need to be handled some other way, because hardware has them too
and it's impossible to predict when they bite. See below.

> Series can be found in my tree in a fix_stop_ep_race branch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_stop_ep_race
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
> fix_stop_ep_race branch
> 
> Do these help in your NEC host case?

This looks like it should work on semi-well-behaved HC like NEC, but it
doesn't account for hardware not restarting an EP "just because".

while true; do ifconfig eth0 up; ifconfig eth0 down; done

locks up on ASM3142 with AX88179 adapter as expected, and when the NIC
is disconnected I get those 'ep ctx error, ep still running' every few
seconds. It seems lots of network code got locked up and I can't even
ssh into the box anymore to copy exact dmesg output.

> I'll see if I can set up some system to trigger this myself

Good idea, lot's of fun ;)

Michal