Re: xhci_handle_command_timeout and wait_for_completion

Joe Lawrence <joe.lawrence@xxxxxxxxxxx> · Tue, 10 May 2016 10:04:44 -0400

On 05/09/2016 06:18 AM, Mathias Nyman wrote:
> On 06.05.2016 23:32, Joe Lawrence wrote:
>> ...snip...
>>
>> Given that the default command timeout is 5 seconds, it seems strange to
>> hit a 120 second hung task warning in this instance.  I can only think
>> that maybe something goofy is going on with xhci_handle_command_timeout
>> and an unfortunately timed host controller removal.
>>
>>
> 
> Idea is that when xhci is removed xhci_mem_cleanup() will take care of
> the pending completions.
> 
> xhci_mem_cleanup()
>   xhci_cleanup_command_queue()
>      list_for_each_entry_safe(cur_cmd, tmp_cmd, &xhci->cmd_list, cmd_list)
>         xhci_complete_del_and_free_cmd(cur_cmd, COMP_CMD_ABORT);
>            if (cmd->completion) {
>         complete(cmd->completion);

xhci_stop will want the xhci->mutex when it calls xhci_mem_cleanup.
What happens if someone like xhci_alloc_dev has that mutex while it's
waiting for TRB_ENABLE_SLOT to complete?

> The issue you see might be related to two things, first is that in an
> optimization,
> xhci_mem_cleanup() was called after removing the first hcd.
> Second is that when we are removing the secondary hcd we still issue
> configure endpoint
> commands when endpoints are dropped and xhci_check_bandwitdh() is called.
> The configure endpoint commands are not needed at this stage if we know
> that the whole
> host is being removed.
> 
> Do the following patches resolve the issue?
> 
> (will be in 4.6, removes extra check_bandwith() call if host is being
> removed)
> 98d74f9ceaefc2b6c4a6440050163a83be0abede
>     xhci: fix 10 second timeout on removal of PCI hotpluggable xhci
> controllers
> 
> (will be sent after 4.7-rc1 is out, call xhci_mem_cleanup after second hcd)
> http://marc.info/?l=linux-usb&m=146134802313366&w=2

Before those patches we've had issues, but not after (the one destined
for 4.7-rc1 solves a very strange MSI cleanup bug for us).  That said,
most of our testing is focused on the RHEL7 kernel, so our upstream
testing is limited.

When we hit the lockup from my first mail in RHEL7.2, I figured I would
report what we've been seeing in case the same scenario exists upstream.

Regards,

-- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html