Re: Kernel panic when unbinding UVC gadget function

Avichal Rakesh <arakesh@xxxxxxxxxx> · Tue, 1 Aug 2023 00:28:03 +0530

On Sat, Jul 29, 2023 at 2:33 AM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Jul 28, 2023 at 11:41:19PM +0530, Avichal Rakesh wrote:
> > On Fri, Jul 28, 2023 at 9:07 PM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote:
> > > It's not clear that the patch will fix the problem entirely.  Your
> > > original analysis of the bug stated:
> > >
> > > > This means that attempting to unregister the UVC Gadget Driver results in the
> > > > V4L2 resource cleanup being stuck behind udc->connect_lock, which will only
> > > > be released after uvc_function_unbind finishes. This results in either the
> > > > gadget deactivating after the unbind process has finished, or in a Kernel Panic
> > > > as it tries to cleanup a V4L2 node that has been purged.
> > >
> > > My patch removes the locking issue.  But if an execution path can
> > > occur with a lock present, it can also occur when the lock has been
> > > removed.  That means it may still be possible for the UVC gadget driver
> > > to try deactivating the UDC after the unbind process has finished or for
> > > it to try cleaning up a V4L2 node that has been purged.
> > >
> > > If either of those really could have happened (as opposed to just
> > > getting stuck in a deadlock, waiting for a mutex that would never be
> > > released), then it can still happen with the patch.  Fixing them would
> > > require changes to the UVC gadget driver.  So the problem may not be
> > > gone entirely.
> > >
> > The current situation can theoretically happen without the deadlock,
> > yes, but shouldn't happen in practice. UVC's disable/unbind code flow
> > looks as follows:
> >
> > 1. When disable callback is called, the gadget driver signals the
> > userspace application to close the V4L2 node.
> > 2. Closing the V4L2 node calls the release callback to clean up
> > resources. It is this callback that calls into gadget_deactivate and
> > gets blocked currently (without your patch).
> > 3. Separately, the unbind callback waits on release callback to finish
> > for 500ms, assuming the userspace application to behave well and close
> > the node in a reasonable amount of time.
> > 4. If the release callback still hasn't been called, the V4L2 node is
> > forcefully removed and UVC driver waits for another 1000ms for the
> > release callback to clean up any pending resources.
> > 5. The unbind process continues regardless of the status of release
> > callback after waiting at most 1.5s for release.
> >
> > So the only way to run into the current issue is if the release
> > callback fails to finish in both step 3 and step 4 (for example, due
> > to a deadlock) in the span of 1.5s. It is possible, but fairly
> > unlikely (at least in my limited understanding) for the release
> > callback to be delayed for quite that long.
> >
> > Hope that makes some sense!
>
> Yes, and it shows that there really is a bug in the UVC driver.  In
> kernel programming, fairly unlikely == not impossible == bound to happen
> eventually!
>
> I don't know enough about the driver to make any detailed
> recommendations.  But you might consider, for example, that if the
> unbind routine can get along with forcibly removing the V4L2 node and
> not waiting for the release callback to finish, then why not have it
> behave that way all the time?  In other words, shorten the timeouts from
> 500 ms and 1000 ms to 0 ms.

Forcibly removing the V4L2 node doesn't clean up the associated
resources (for example, the fd held by the userspace application),
so we risk running into kernel panics if the V4L2 node is forcibly
removed without a clean release from the userspace application.

I don't see an easy way to reduce or remove the timeouts entirely,
but I might be missing something simple again. Dan and Laurent,
if you have ideas around this, I am happy to test stuff out!

>
> Whether you do that or not, someone definitely should fix up the release
> routine so that it won't get into trouble if it is called after (or
> concurrently with) all of the cleanup operations -- which is quite
> likely to happen if those timeouts are eliminated!  In particular, it
> shouldn't call gadget_deactivate unless it knows that an unbind hasn't
> happened yet.  And if that is the case, it should block the unbind
> routine until gadget_deactivate returns.  Basically, it's a bug for a
> function driver to call any gadget core routine after its unbind
> callback has returned.
>

This seems like a reasonable check to have. I am out until next week,
but I can test and put up a patch once I get back!

Thank you!

- Avi.