On 30.6.2020 19.58, Fabian Melzow wrote: > Hi! > > Am Mon, 29 Jun 2020 20:47:24 +0300 > schrieb Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx>: > >> First issue I see is that the attempt to recover from a transaction >> error with a soft retry isn't working. We expect the hardware to >> retry the transfer but nothing seems to happen. Soft retry is >> described in xhci specs 4.6.8.1 and is basically a reset endpoint >> command with TSP set, followed by ringing the endpoint doorbell. >> Traces indicate driver does this correctly but hardware isn't >> retrying. We get don't get any event, no error, success or stall. >> >> This could be hardware flaw. >> Any chance you could try this on a xHC from some other vendor? > > There is no other xHC hardware available to me. > >> Second issue is a driver flaw, when nothing happened for 20 seconds >> we see the URB is canceled. xhci driver needs to stop then endpoint >> to cancel the URB, but there is a hw race and endpoint ends up halted >> instead of stopped. The xhci driver can't handle a halted endpoint in >> its stop endpoint handler properly, and the URB is never actually >> removed from the ring. >> >> The reason you see the IO_PAGE_FAULT is probably because once the >> ring starts running the driver will handle the cancelled URB, and >> touch already freed memory: AMD-Vi: Event logged [IO_PAGE_FAULT >> domain=0x000d address=0xdc707028 flags=0x0020] >> >> I have a patch for this second case, I haven't upstreamed it as it >> got some conflicting feedback earlier. It won't solve the 20 second >> delay, but should solve the the IO_PAGE_FAULT and the "WARN Set TR >> Deq Ptr cmd failed due to incorrect slot or ep state" message >> >> Can you try it out? > > I successful applied the patch against Linux 5.7.4, but get this error when > compiling drivers/usb/host/xhci-ring.c: > > CC [M] drivers/usb/host/xhci-ring.o > drivers/usb/host/xhci-ring.c: In function ‘xhci_handle_cmd_stop_ep’: > drivers/usb/host/xhci-ring.c:857:3: error: implicit declaration of function ‘xhci_reset_halted_ep’ [-Werror=implicit-function-declaration] > 857 | xhci_reset_halted_ep(xhci, slot_id, ep_index, reset_type); > | ^~~~~~~~~~~~~~~~~~~~ > Right, forgot that you need another patch before this. both patches attached, also applied to 5.8-rc1 in branch "fix_invalid_context_at_stop_endpoint" git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint -Mathias
From ddb2004cb51fcf5acd668590c56fbf571ca66071 Mon Sep 17 00:00:00 2001 From: Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> Date: Tue, 17 Dec 2019 14:03:34 +0200 Subject: [PATCH 1/2] xhci: crete xhci_reset_halted_ep() helper function Create a separate helper function to only issue reset endpont commands to clear halted endpoints. This will be useful for cases where the a halted endpoint is discovered while completing some other command, and needs to be cleared before continuing. No functional changes Signed-off-by: Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> --- drivers/usb/host/xhci-ring.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 2c255d0620b0..5c223e92b8db 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -708,6 +708,26 @@ static void xhci_unmap_td_bounce_buffer(struct xhci_hcd *xhci, seg->bounce_offs = 0; } +static int xhci_reset_halted_ep(struct xhci_hcd *xhci, unsigned int slot_id, + unsigned int ep_index, enum xhci_ep_reset_type reset_type) +{ + struct xhci_command *command; + int ret = 0; + + command = xhci_alloc_command(xhci, false, GFP_ATOMIC); + if (!command) { + ret = -ENOMEM; + goto done; + } + + ret = xhci_queue_reset_ep(xhci, command, slot_id, ep_index, reset_type); +done: + if (ret) + xhci_err(xhci, "ERROR queuing reset endpoint for slot %d ep_index %d, %d\n", + slot_id, ep_index, ret); + return ret; +} + /* * When we get a command completion for a Stop Endpoint Command, we need to * unlink any cancelled TDs from the ring. There are two ways to do that: @@ -1855,7 +1875,7 @@ static void xhci_cleanup_halted_endpoint(struct xhci_hcd *xhci, enum xhci_ep_reset_type reset_type) { struct xhci_virt_ep *ep = &xhci->devs[slot_id]->eps[ep_index]; - struct xhci_command *command; + int err; /* * Avoid resetting endpoint if link is inactive. Can cause host hang. @@ -1864,13 +1884,11 @@ static void xhci_cleanup_halted_endpoint(struct xhci_hcd *xhci, if (xhci->devs[slot_id]->flags & VDEV_PORT_ERROR) return; - command = xhci_alloc_command(xhci, false, GFP_ATOMIC); - if (!command) - return; - ep->ep_state |= EP_HALTED; - xhci_queue_reset_ep(xhci, command, slot_id, ep_index, reset_type); + err = xhci_reset_halted_ep(xhci, slot_id, ep_index, reset_type); + if (err) + return; if (reset_type == EP_HARD_RESET) { ep->ep_state |= EP_HARD_CLEAR_TOGGLE; -- 2.17.1
From 9483b48fd7167358d76d86d48790df4301ec7b43 Mon Sep 17 00:00:00 2001 From: Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> Date: Tue, 7 Jan 2020 16:12:17 +0200 Subject: [PATCH 2/2] xhci: fix halted endpoint at stop endpoint command completion xhci 4.6.9: A Busy endpoint may asynchronously transition from the Running to the Halted or Error state due to error conditions detected while processing TRBs. A possible race condition may occur if software, thinking an endpoint is in the Running state, issues a Stop Endpoint Command however at the same time the xHC asynchronously transitions the endpoint to the Halted or Error state. In this case, a Context State Error may be generated for the command completion. Software may verify that this case occurred by inspecting the EP State for Halted or Error when a Stop Endpoint Command results in a Context State Error. Fix this case by resetting the halted endpoint after cleaning up the calcelled trbs from the ring. If the TRB we halted on was cancelled then queue a new set TR dequeue pointer command as usually. If it wasn't cancelled then move past it with a set TR dequeue pointer and give it back with -EPIPE status as in a normal halted endpoint case. Signed-off-by: Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> --- drivers/usb/host/xhci-ring.c | 56 ++++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 5c223e92b8db..ceb3fac3f1c9 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -745,11 +745,13 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id, struct xhci_ring *ep_ring; struct xhci_virt_ep *ep; struct xhci_td *cur_td = NULL; + struct xhci_td *halted_td = NULL; struct xhci_td *last_unlinked_td; struct xhci_ep_ctx *ep_ctx; struct xhci_virt_device *vdev; u64 hw_deq; struct xhci_dequeue_state deq_state; + u32 comp_code; if (unlikely(TRB_TO_SUSPEND_PORT(le32_to_cpu(trb->generic.field[3])))) { if (!xhci->devs[slot_id]) @@ -764,9 +766,19 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id, vdev = xhci->devs[slot_id]; ep_ctx = xhci_get_ep_ctx(xhci, vdev->out_ctx, ep_index); + trace_xhci_handle_cmd_stop_ep(ep_ctx); ep = &xhci->devs[slot_id]->eps[ep_index]; + comp_code = GET_COMP_CODE(le32_to_cpu(event->status)); + + if (comp_code == COMP_CONTEXT_STATE_ERROR) { + /* endpoint is halted and needs to be reset */ + if (GET_EP_CTX_STATE(ep_ctx) == EP_STATE_HALTED) { + ep->ep_state |= EP_HALTED; + } + } + last_unlinked_td = list_last_entry(&ep->cancelled_td_list, struct xhci_td, cancelled_td_list); @@ -833,16 +845,60 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id, xhci_stop_watchdog_timer_in_irq(xhci, ep); + /* + * If stop endpoint command raced with a halting endpoint we need to + * reset the endpoint first. If the TD we halted on isn't cancelled we + * must give it back with -EPIPE status, and move ring dequeue past it. + * If we can't find hw_deq, or the TD we halted on, do a soft reset + */ + /* FIXME, is there a risk EP_HALTED is set from other cases */ + if (ep->ep_state & EP_HALTED) { + enum xhci_ep_reset_type reset_type = EP_SOFT_RESET; + struct xhci_td *td; + + if (deq_state.new_deq_ptr && deq_state.new_deq_seg) { + reset_type = EP_HARD_RESET; + } else if (ep->ep_state & EP_HAS_STREAMS) { + /* soft reset, nothing else */ + } else if (!list_empty(&ep->ring->td_list)) { + hw_deq = xhci_get_hw_deq(xhci, vdev, ep_index, 0); + hw_deq &= ~0xf; + td = list_first_entry(&ep->ring->td_list, + struct xhci_td, td_list); + if (trb_in_td(xhci, td->start_seg, td->first_trb, + td->last_trb, hw_deq, false)) { + halted_td = td; + reset_type = EP_HARD_RESET; + xhci_find_new_dequeue_state(xhci, slot_id, + ep_index, 0, td, + &deq_state); + } + } + xhci_reset_halted_ep(xhci, slot_id, ep_index, reset_type); + /* FIXME xhci_clear_hub_tt_buffer(xhci, td, ep); */ + } + /* If necessary, queue a Set Transfer Ring Dequeue Pointer command */ if (deq_state.new_deq_ptr && deq_state.new_deq_seg) { xhci_queue_new_dequeue_state(xhci, slot_id, ep_index, &deq_state); xhci_ring_cmd_db(xhci); + } else if (ep->ep_state & EP_HALTED) { + xhci_ring_cmd_db(xhci); /* for endpoint soft reset command */ } else { /* Otherwise ring the doorbell(s) to restart queued transfers */ ring_doorbell_for_active_rings(xhci, slot_id, ep_index); } + /* If TD we halted on wasn't cancelled give it back with -EPIPE */ + if (halted_td) { + xhci_unmap_td_bounce_buffer(xhci, ep->ring, halted_td); + list_del_init(&halted_td->td_list); + inc_td_cnt(halted_td->urb); + if (last_td_in_urb(halted_td)) + xhci_giveback_urb_in_irq(xhci, halted_td, -EPIPE); + } + /* * Drop the lock and complete the URBs in the cancelled TD list. * New TDs to be cancelled might be added to the end of the list before -- 2.17.1