The TLDR version: we need to turn command TRBs into a no-op command, even if the command dequeue pointer points to them on a command ring stop completion event. When the xHCI host controller is in the middle of a command, and software writes the command abort bit in the CRCR register, the host will do four things: 1. Abort the command and stop processing the command ring 2. Queue a completion event with the completion code set to Command Abort to the event ring 3. Move the command ring dequeue pointer to the next TRB 4. Queue a completion event with the completion code set to Command Ring Stop to the event ring. However, if the host was in between processing TRBs, it will not do step two. The requirement in step three is ambiguous. If the host was between command TRB processing, it may stop on the command we wished to cancel, or it may move past that command, depending on the internal host controller management of the dequeue pointer. The current code does not handle this corner case very well: static void handle_cmd_completion(struct xhci_hcd *xhci, struct xhci_event_cmd *event) { ... if ((GET_COMP_CODE(le32_to_cpu(event->status)) == COMP_CMD_ABORT) || (GET_COMP_CODE(le32_to_cpu(event->status)) == COMP_CMD_STOP)) { /* If the return value is 0, we think the trb pointed by * command ring dequeue pointer is a good trb. The good * trb means we don't want to cancel the trb, but it have * been stopped by host. So we should handle it normally. * Otherwise, driver should invoke inc_deq() and return. */ if (handle_stopped_cmd_ring(xhci, GET_COMP_CODE(le32_to_cpu(event->status)))) { inc_deq(xhci, xhci->cmd_ring); return; } } ... } static int handle_stopped_cmd_ring(struct xhci_hcd *xhci, int cmd_trb_comp_code) { int cur_trb_is_good = 0; /* Searching the cmd trb pointed by the command ring dequeue * pointer in command descriptor list. If it is found, free it. */ cur_trb_is_good = xhci_search_cmd_trb_in_cd_list(xhci, xhci->cmd_ring->dequeue); if (cmd_trb_comp_code == COMP_CMD_ABORT) xhci->cmd_ring_state = CMD_RING_STATE_STOPPED; else if (cmd_trb_comp_code == COMP_CMD_STOP) { ... if (xhci->cmd_ring->dequeue != xhci->cmd_ring->enqueue) xhci_ring_cmd_db(xhci); } return cur_trb_is_good; } In this corner case, the first completion event we get has the command completion code set to COMP_CMD_STOP. The command ring dequeue pointer in the hardware points to the command we wished to cancel (that the hardware has not processed yet). handle_stopped_cmd_ring() will return true, since xhci_search_cmd_trb_in_cd_list() just canceled the command at the command ring dequeue pointer. However, xhci_search_cmd_trb_in_cd_list() does not turn the command into a no-op command. handle_stopped_cmd_ring() goes on to ring the doorbell, which will cause the host to start processing commands. When that function returns, we'll increment the software's dequeue pointer, which is no longer in sync with the hardware dequeue pointer. This could produce some pretty disastrous results, if the hardware tries to re-fetch DMA memory that has been freed. As is, with a few modifications to the xHCI driver to not ring the command doorbell after a Set Address command, the host hardware re-executes the canceled Set Address, causing the next Set Address command to fail: Mar 7 12:01:47 xanatos kernel: [ 126.442538] xhci_hcd 0000:00:14.0: TEST, TEST! Stalling Set Address for 60 seconds. Mar 7 12:01:47 xanatos kernel: [ 126.498291] xhci_hcd 0000:00:14.0: xhci_hub_status_data: stopping port polling. Mar 7 12:01:49 xanatos kernel: [ 128.263795] Initializing USB Mass Storage driver... Mar 7 12:01:49 xanatos kernel: [ 128.263874] usb-storage 2-1:1.0: usb_probe_interface Mar 7 12:01:49 xanatos kernel: [ 128.263881] usb-storage 2-1:1.0: usb_probe_interface - got id Mar 7 12:01:49 xanatos kernel: [ 128.263987] xhci_hcd 0000:00:14.0: Set up evaluate context for LPM MEL change. ... Mar 7 12:01:54 xanatos kernel: [ 133.253622] xhci_hcd 0000:00:14.0: Timeout while waiting for evaluate context command, retry ... Mar 7 12:02:21 xanatos kernel: [ 160.027546] xhci_hcd 0000:00:14.0: Command ring: Mar 7 12:02:21 xanatos kernel: [ 160.027547] xhci_hcd 0000:00:14.0: @00000000ab5ac000 00000000 00000000 00000000 00002401 Mar 7 12:02:21 xanatos kernel: [ 160.027549] xhci_hcd 0000:00:14.0: @00000000ab5ac010 b7796000 00000000 00000000 01002c01 Mar 7 12:02:21 xanatos kernel: [ 160.027551] xhci_hcd 0000:00:14.0: @00000000ab5ac020 b7796000 00000000 00000000 01003001 Mar 7 12:02:21 xanatos kernel: [ 160.027553] xhci_hcd 0000:00:14.0: @00000000ab5ac030 ab5ab000 00000000 00000000 01003401 Mar 7 12:02:21 xanatos kernel: [ 160.027554] xhci_hcd 0000:00:14.0: @00000000ab5ac040 ab5ab000 00000000 00000000 01003401 Mar 7 12:02:21 xanatos kernel: [ 160.027556] xhci_hcd 0000:00:14.0: @00000000ab5ac050 00000000 00000000 00000000 00002401 Mar 7 12:02:21 xanatos kernel: [ 160.027558] xhci_hcd 0000:00:14.0: @00000000ab5ac060 b7636000 00000000 00000000 02002c01 <--- set address command Mar 7 12:02:21 xanatos kernel: [ 160.027559] xhci_hcd 0000:00:14.0: @00000000ab5ac070 ab5ab000 00000000 00000000 01003401 <--- eval context command ... Mar 7 12:02:47 xanatos kernel: [ 186.329990] xhci_hcd 0000:00:14.0: Timeout while waiting for address device command Mar 7 12:02:47 xanatos kernel: [ 186.329996] xhci_hcd 0000:00:14.0: Abort command ring Mar 7 12:02:47 xanatos kernel: [ 186.330013] xhci_hcd 0000:00:14.0: Handle stopped command ring, dequeue = 00000000ab5ac060. Mar 7 12:02:47 xanatos kernel: [ 186.330016] xhci_hcd 0000:00:14.0: xhci_search_cmd_trb_in_cd_list canceling command 00000000ab5ac060 Mar 7 12:02:47 xanatos kernel: [ 186.330018] xhci_hcd 0000:00:14.0: No more commands to cancel Mar 7 12:02:47 xanatos kernel: [ 186.330019] xhci_hcd 0000:00:14.0: // Ding dong! Mar 7 12:02:47 xanatos kernel: [ 186.330086] xhci_hcd 0000:00:14.0: Bad command completion for 00000000ab5ac060, dequeue at 00000000ab5ac070. The last line is where we got a completion event for the command we were trying to cancel. Next, the host starts failing subsequent Set Address commands because its internal state shows the device is already addressed: Mar 7 12:02:47 xanatos kernel: [ 186.533706] xhci_hcd 0000:00:14.0: // Ding dong! Mar 7 12:02:47 xanatos kernel: [ 186.533719] xhci_hcd 0000:00:14.0: Set Address command completion for 00000000ab5ac0a0, comp code 19. Mar 7 12:02:47 xanatos kernel: [ 186.533734] xhci_hcd 0000:00:14.0: Setup ERROR: address device command for slot 2. Fix this by turning commands into no-ops when we cancel them in xhci_search_cmd_trb_in_cd_list(). It's okay if software moves its internal dequeue pointer past the command, but the hardware dequeue pointer is on that no-op'ed command, because we'll just ignore the no-op command completion. This patch should be backported to kernels as old as 3.0, that contain the commit b63f4053cc8aa22a98e3f9a97845afe6c15d0a0d "xHCI: handle command after aborting the command ring" Signed-off-by: Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx> Cc: stable@xxxxxxxxxxxxxxx --- drivers/usb/host/xhci-ring.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 79ec0be..d59eef6 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -1313,6 +1313,7 @@ static int xhci_search_cmd_trb_in_cd_list(struct xhci_hcd *xhci, list_for_each_entry_safe(cur_cd, next_cd, &xhci->cancel_cmd_list, cancel_cmd_list) { if (cur_cd->cmd_trb == cmd_trb) { + xhci_cmd_to_noop(xhci, cur_cd); if (cur_cd->command) xhci_complete_cmd_in_cmd_wait_list(xhci, cur_cd->command, COMP_CMD_STOP); -- 1.7.9 -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html