Re: [PATCH] usb: xhci: Use non-interruptible sleep for XHCI commands

Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx> · Wed, 29 May 2013 11:06:52 -0700

I really wish you had contacted me about this issue before writing code
to fix it.  Now I don't know the backstory behind this issue, which
makes it hard to evaluate whether this fix is correct.

On Fri, May 24, 2013 at 06:39:37PM -0700, Julius Werner wrote:
> The XHCI stack usually uses wait_for_completion_interruptible_timeout()
> to wait for the completion of an XHCI command, and treats both timeouts
> and interruptions as errors. This is a bad idea, since these commands
> are often essential for the correct operation of the USB stack, and
> their failure can leave devices in a misconfigured and/or unusable
> state.

What causes the devices to be "unusable"?  If a Configure Endpoint
command fails, the USB core is supposed to try to put the device back
into its original state.  Is there a mismatch caused by the command
being interrupted, and if so, how can we fix it, rather than allowing
the kernel to go into an uninterruptible sleep?

> While a timeout probably means a real problem that needs to be
> dealt with, a random signal to the controlling user-space process should
> not cause such harsh consequences.

We should deal with the canceled command gracefully, rather than
papering over the issue.

> This patch changes that behavior to use uninterruptible waits instead.
> It fixes an easy to reproduce bug that occurs when you kill -9 a
> process that reads from a UVC camera. The UVC driver's release code
> tries to set the camera's alternate setting back to 0, but the lingering
> SIGKILL instantly aborts any Stop Endpoint or Configure Endpoint command
> sent to the xHC. The code dies halfway through the bandwidth allocation
> process, leaving the device in an active configuration and preventing
> future access to it due to the now out-of-sync bandwidth calculation.

Obviously the command handling needs to be re-worked so this bandwidth
mismatch doesn't happen.  What I would like to see instead (but have not
had time to implement) is a global command queue manager.  It would have
a queue of commands, similar to what we do for the TD lists, but only
one list per xhci_hcd.  The command queue manager would handle
cancellation requests, and properly keep track of whether each command
completed.

Functions submitting commands would all have a unique completion, and
wait (interruptibly) on that completion.  Commands that are interrupted
with a signal should attempt to cancel the command, and wait on the
completion to see if their command was canceled or successfully
completed.  If it was successfully completed, it should return success,
rather than -ETIMEOUT.

I think that change will fix the case where the bandwidth allocation
gets out-of-sync with the USB core, but since you haven't shared the
details of how the code handles being interrupted, I can't tell whether
this is actually a good solution.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html