Hi, I will send out v3 of my own patches soon. I also plan to research a radically different solution, which is simply to prevent failed Stop Endpoint in the first place. General idea: 1. If commands pending, "outsource" the work to their handlers. 2. If EP stopped for other reason, invalidate/giveback immediately. 3. If Context State != Stopped, queue Stop Endpoint. 4. If Context State == Stopped but shouldn't, set up a delayed work. 5. The work looks at Context State and goto 3. 6. The work may choose to give up retrying after some time. 7. The work could even act as watchdog and call hc_died() if Set Deq or Reset EP get stuck in retry or abort loops (not seen so far). On Wed, 30 Oct 2024 10:29:12 +0200, Mathias Nyman wrote: > 1st patch avoids stopping endpoint in urb cancel if Set TR Deq is > pending > 2nd patch handles Set TR Deq command ctx error due to running ep. > 3rd patch tracks doorbell ring with a flag. It's for now only > used to prevent infinite stop ep retries. Flag is not properly > cleared in other cases. Quick comments: 1. To be specific, my suggestion was to make the function work even if called under SET_DEQ_PENDING rather than try to avoid this. It ends up simpler IMO and solves any risk of calls from other places. Then the change in xhci_urb_dequeue() becomes an optimization only, which is not required for correct operation, may be combined with other similar optimizations, or even reverted if necessary without breaking multiple stream cancellations again. 2. Mixed feelings. Adds complexity, obviously. Shouldn't be necessary if the retries prove to work on other chips (I have never had a Set TR Deq error on NEC with the workaround). Could help otherwise. 3. No need to pollute handle_tx_event() at all, because the flag only needs to be guaranteed false when the EP is Stopped, and this can only happen by *successful* execution of Stop EP or Reset EP. Easy to detect this in handle_cmd_completion(). But I'm not sure if a new flag is needed at all. Its value will simply be negation of the condition in xhci_ring_ep_doorbell(), except for cases when we "forget" to ring EP doorbell, which are probably bugs that should be fixed. Bugs need to be handled some other way, because hardware has them too and it's impossible to predict when they bite. See below. > Series can be found in my tree in a fix_stop_ep_race branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_stop_ep_race > git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git > fix_stop_ep_race branch > > Do these help in your NEC host case? This looks like it should work on semi-well-behaved HC like NEC, but it doesn't account for hardware not restarting an EP "just because". while true; do ifconfig eth0 up; ifconfig eth0 down; done locks up on ASM3142 with AX88179 adapter as expected, and when the NIC is disconnected I get those 'ep ctx error, ep still running' every few seconds. It seems lots of network code got locked up and I can't even ssh into the box anymore to copy exact dmesg output. > I'll see if I can set up some system to trigger this myself Good idea, lot's of fun ;) Michal