Hello Mathias, In c311e391a7ef "xhci: rework command timeout and cancellation," xHCI command timeouts were refactored to flow through xhci_handle_command_timeout. We've seen a few instances of hangs/crashes with upstream and RHEL7.2 kernels where someone gets stuck on wait_for_completion(cmd->completion) when the underlying host controller has been hotplug removed. Without wait_for_completion_interruptible_timeout, I was wondering how xhci_handle_command_timeout would handle scenarios that avoid the code inside the ESHUTDOWN case below: void xhci_handle_command_timeout(unsigned long data) { struct xhci_hcd *xhci; int ret; unsigned long flags; u64 hw_ring_state; struct xhci_command *cur_cmd = NULL; xhci = (struct xhci_hcd *) data; /* mark this command to be cancelled */ spin_lock_irqsave(&xhci->lock, flags); if (xhci->current_cmd) { cur_cmd = xhci->current_cmd; cur_cmd->status = COMP_CMD_ABORT; } /* Make sure command ring is running before aborting it */ hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring); if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) && (hw_ring_state & CMD_RING_RUNNING)) { spin_unlock_irqrestore(&xhci->lock, flags); xhci_dbg(xhci, "Command timeout\n"); ret = xhci_abort_cmd_ring(xhci); if (unlikely(ret == -ESHUTDOWN)) { >> xhci_err(xhci, "Abort command ring failed\n"); >> xhci_cleanup_command_queue(xhci); >> usb_hc_died(xhci_to_hcd(xhci)->primary_hcd); >> xhci_dbg(xhci, "xHCI host controller is dead.\n"); } return; } /* command timeout on stopped ring, ring can't be aborted */ xhci_dbg(xhci, "Command timeout on stopped ring\n"); xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd); spin_unlock_irqrestore(&xhci->lock, flags); return; } When xhci_abort_cmd_ring returns -ESHUTDOWN, it's clear that xhci_cleanup_command_queue will iterate through the cmd_list, calling complete on the cmd->completion variable when necessary. I was uncertain about cases that don't fall into the -ESHUTDOWN path. By what mechanism will they (re)timeout or hit handle_cmd_completion? What happens if the device were to be removed before then? To give you an idea of the kinds of lockups we're seeing, here's an example from our RHEL7.2 machine. PID 4892 wants the dev->mutex that PID 44765 has. The latter is waiting for a command completion and the former grows tired and emits a hung task warning: INFO: task java:4892 blocked for more than 120 seconds. ... PID: 4892 TASK: ffff880852305080 CPU: 42 COMMAND: "java" #0 [ffff880fed2078d0] __schedule at ffffffff8163a78d #1 [ffff880fed207938] schedule_preempt_disabled at ffffffff8163bf09 #2 [ffff880fed207948] __mutex_lock_slowpath at ffffffff81639c05 #3 [ffff880fed2079a8] mutex_lock at ffffffff8163906f #4 [ffff880fed2079c0] usb_disconnect at ffffffff81438957 #5 [ffff880fed207a08] usb_remove_hcd at ffffffff8143e3e1 #6 [ffff880fed207a38] usb_hcd_pci_remove at ffffffff814509ff #7 [ffff880fed207a60] xhci_pci_remove at ffffffff81456901 #8 [ffff880fed207a80] pci_device_remove at ffffffff81328d9b #9 [ffff880fed207aa8] __device_release_driver at ffffffff813f60ef #10 [ffff880fed207ac8] device_release_driver at ffffffff813f6183 #11 [ffff880fed207ae8] pci_stop_bus_device at ffffffff81322104 #12 [ffff880fed207b10] pci_stop_bus_device at ffffffff813220ab #13 [ffff880fed207b38] pci_stop_bus_device at ffffffff813220ab #14 [ffff880fed207b60] pci_stop_and_remove_bus_device at ffffffff813221f2 #15 [ffff880fed207b78] ioboard_bringdown at ffffffffa05eb4c7 [ftmod] #16 [ffff880fed207ba0] ftmod_ioboard_event at ffffffffa05eb7c6 [ftmod] #17 [ffff880fed207c28] io_state_change at ffffffffa05e92d1 [ftmod] #18 [ffff880fed207c78] OsIoStateChange at ffffffffa06ba2e4 [ccmod] #19 [ffff880fed207c90] IoStateChange at ffffffffa06bad18 [ccmod] #20 [ffff880fed207dc0] CcIoBringDownEx at ffffffffa06baee0 [ccmod] #21 [ffff880fed207e20] CcIoBringDownEx at ffffffffa05e54db [ftmod] #22 [ffff880fed207e48] cmd_bringdown at ffffffffa05e68b8 [ftmod] #23 [ffff880fed207e60] ctl_ioctl at ffffffffa05e6c22 [ftmod] #24 [ffff880fed207eb8] do_vfs_ioctl at ffffffff811f1f15 #25 [ffff880fed207f30] sys_ioctl at ffffffff811f2191 #26 [ffff880fed207f80] system_call_fastpath at ffffffff81645e89 RIP: 00007f88860cd257 RSP: 00007f87a0d323d8 RFLAGS: 00010206 RAX: 0000000000000010 RBX: ffffffff81645e89 RCX: 0000000000000000 RDX: 00007f87a0d32480 RSI: 0000000000000000 RDI: 0000000000000021 RBP: 00007f87a0d323d0 R8: 0000000000000031 R9: 0000000000000002 R10: 00007f8728000b72 R11: 0000000000000217 R12: 00007f8758073000 R13: 00007f87a0d369c0 R14: 00000000feac7a48 R15: 0000000000000000 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b PID: 44765 TASK: ffff8807e215d080 CPU: 3 COMMAND: "kworker/3:0" #0 [ffff880753e639e0] __schedule at ffffffff8163a78d #1 [ffff880753e63a48] schedule at ffffffff8163ae29 #2 [ffff880753e63a58] schedule_timeout at ffffffff81638b19 #3 [ffff880753e63b08] wait_for_completion at ffffffff8163b1f6 #4 [ffff880753e63b68] xhci_configure_endpoint at ffffffff814754ad #5 [ffff880753e63bc0] xhci_check_bandwidth at ffffffff81476106 #6 [ffff880753e63c08] usb_hcd_alloc_bandwidth at ffffffff814401b5 #7 [ffff880753e63c60] usb_disable_device at ffffffff8144345a #8 [ffff880753e63ca0] usb_disconnect at ffffffff814389ad #9 [ffff880753e63ce8] hub_port_connect at ffffffff8143a4e1 #10 [ffff880753e63d68] hub_event at ffffffff8143b5af #11 [ffff880753e63e20] process_one_work at ffffffff8109d5db #12 [ffff880753e63e68] worker_thread at ffffffff8109e3ab #13 [ffff880753e63ec8] kthread at ffffffff810a5acf #14 [ffff880753e63f50] ret_from_fork at ffffffff81645dd8 crash utility claims that 44765 hasn't run in ~2.5 minutes: crash> ps -m 44765 [0 00:02:30.754] [UN] PID: 44765 TASK: ffff8807e215d080 CPU: 3 COMMAND: "kworker/3:0" And xhci_handle_command_timeout is still 4412 jiffies out: crash> timer ... TVEC_BASES[3]: ffff880853884000 JIFFIES 4418029572 EXPIRES TIMER_LIST FUNCTION ... 4418033984 ffff880813a6e0a8 ffffffff8147e0f0 <xhci_handle_command_timeout> ... Given that the default command timeout is 5 seconds, it seems strange to hit a 120 second hung task warning in this instance. I can only think that maybe something goofy is going on with xhci_handle_command_timeout and an unfortunately timed host controller removal. Regards, -- Joe -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html