On Wed, May 29, 2013 at 11:06 AM, Sarah Sharp <sarah.a.sharp@xxxxxxxxxxxxxxx> wrote: > I really wish you had contacted me about this issue before writing code > to fix it. Now I don't know the backstory behind this issue, which > makes it hard to evaluate whether this fix is correct. No worries... this was a very trivial patch and didn't take long. I won't mind if we settle on a different solution here eventually (even though I think this is the way that makes sense). The issue is very easy to reproduce: just read of a UVC camera (I use Python with "import cv2; dev = cv2.VideoCapture(0); dev.read()") and kill -9 the process. However, you will have to pull in my other patch first ("usb: xhci: Fix Command Ring Stopped Event handling"), or you may run into a follow-up problem with cancelled commands that kills your whole HCD. > On Fri, May 24, 2013 at 06:39:37PM -0700, Julius Werner wrote: >> The XHCI stack usually uses wait_for_completion_interruptible_timeout() >> to wait for the completion of an XHCI command, and treats both timeouts >> and interruptions as errors. This is a bad idea, since these commands >> are often essential for the correct operation of the USB stack, and >> their failure can leave devices in a misconfigured and/or unusable >> state. > > What causes the devices to be "unusable"? If a Configure Endpoint > command fails, the USB core is supposed to try to put the device back > into its original state. Is there a mismatch caused by the command > being interrupted, and if so, how can we fix it, rather than allowing > the kernel to go into an uninterruptible sleep? The problem is that this happens while the USB core is already trying to put the device back into its default/inactive state: the process gets killed, its /dev/video0 file descriptor gets closed, the uvc_v4l2_release() handler runs and eventually calls usb_set_interface() to return the device to alternate setting 0 (back from the "active" alternate setting that it was in while transmitting video). But that alternate setting change requires an XHCI command, which gets immediately cancelled because the SIGKILL signal keeps lingering on the process until it is dead. (To be honest I am not quite sure why the device stays unusable after that... I just get bandwidth exceeded errors when I try to read it again from a new process. Maybe there is another error in the bandwidth management code there, but the problem remains that the UVC driver should be allowed to correctly return the device to its default state even during a SIGKILL.) The other thing to note is that we already use uninterruptible sleeps in many other places -- the USB core does it all the time: most prominently in usb_start_wait_urb(), which is used for device communication (SET_ADDRESS, SET_CONFIGURATION, etc.) during enumeration (presumably to avoid similar problems as this patch). This function is even used for the actual SET_INTERFACE message that is sent to the device when changing alternate settings... so does it make sense to make the (usually very fast and reliable, unless the hardware is completely broken) communication with the xHC interruptible, when for the same operation the much more unreliable communication with the device is not? >> This patch changes that behavior to use uninterruptible waits instead. >> It fixes an easy to reproduce bug that occurs when you kill -9 a >> process that reads from a UVC camera. The UVC driver's release code >> tries to set the camera's alternate setting back to 0, but the lingering >> SIGKILL instantly aborts any Stop Endpoint or Configure Endpoint command >> sent to the xHC. The code dies halfway through the bandwidth allocation >> process, leaving the device in an active configuration and preventing >> future access to it due to the now out-of-sync bandwidth calculation. > > Obviously the command handling needs to be re-worked so this bandwidth > mismatch doesn't happen. What I would like to see instead (but have not > had time to implement) is a global command queue manager. It would have > a queue of commands, similar to what we do for the TD lists, but only > one list per xhci_hcd. The command queue manager would handle > cancellation requests, and properly keep track of whether each command > completed. > > Functions submitting commands would all have a unique completion, and > wait (interruptibly) on that completion. Commands that are interrupted > with a signal should attempt to cancel the command, and wait on the > completion to see if their command was canceled or successfully > completed. If it was successfully completed, it should return success, > rather than -ETIMEOUT. > > I think that change will fix the case where the bandwidth allocation > gets out-of-sync with the USB core, but since you haven't shared the > details of how the code handles being interrupted, I can't tell whether > this is actually a good solution. This sounds like a good idea in general, but I don't think it will fix this problem. The issue is not that command cancellation doesn't work (at least with my other patch, it does), it's just that the USB core cannot do its job when every command gets cancelled immediately due to a lingering signal. The only thing it can do when an add_endpoint() or drop_endpoint() fails is to assume the device/slot is really hosed and leave it in whatever state it is (and even if we solved the bandwidth code so that it could recover from there later, I think leaving it there in the first place is not good behavior). Therefore I think cancelling commands (and thus failing those low-level operations) due to something as trivial as a signal is not a good idea in general (and the rest of the USB core seems to be implemented according to that philosophy as well). -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html