Re: Kernel panic when unbinding UVC gadget function

Avichal Rakesh <arakesh@xxxxxxxxxx> · Fri, 28 Jul 2023 23:41:19 +0530

On Fri, Jul 28, 2023 at 9:07 PM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Jul 28, 2023 at 10:05:39AM -0400, Alan Stern wrote:
> > On Fri, Jul 28, 2023 at 01:57:15PM +0530, Avichal Rakesh wrote:
> > > On Sat, Jul 22, 2023 at 8:57 PM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > > > Here's a proposal, along the lines of your first suggestion above.  The
> > > > idea is to avoid holding the connect_lock mutex while invoking a gadget
> > > > driver's callbacks.
> > > >
> > > > Unfortunately, this is unavoidable in the case of the ->disconnect()
> > > > callback, but that's okay because the kerneldoc already says that
> > > > ->disconnect() may be called in_interrupt, so it's not allowed to call
> > > > any core routines that may sleep.  Just to make this perfectly clear,
> > > > the patch adds a couple of comments along these lines.
> > > >
> > > > As far as I can tell, there is no real reason to hold connect_lock
> > > > during the ->unbind() callback.  Not doing so should fix the problem in
> > > > the UVC function driver.
> > >
> > > Thank you for the patch (and apologies for the delay)! This is a neat
> > > fix I completely glossed over. Looked around at other places where
> > > unbind is called, and it looks like the lock isn't held anywhere else
> > > either, so dropping the lock before calling unbind shouldn't break any
> > > existing assumptions around the callback.
> >
> > > Tried the patch, and it fixes the issue for UVC Gadget Driver! UVC
> > > driver can now be unbound without locking up with the V4L2 release
> > > callback. Logs confirm that the calls are still being interleaved, but
> > > don't result in a deadlock now.
> >
> > Thanks for trying it out.  Is it okay for me to add your "Tested-by:"
> > tag when the patch is submitted?

Oh, yes of course!

>
> Another thing...
>
> It's not clear that the patch will fix the problem entirely.  Your
> original analysis of the bug stated:
>
> > This means that attempting to unregister the UVC Gadget Driver results in the
> > V4L2 resource cleanup being stuck behind udc->connect_lock, which will only
> > be released after uvc_function_unbind finishes. This results in either the
> > gadget deactivating after the unbind process has finished, or in a Kernel Panic
> > as it tries to cleanup a V4L2 node that has been purged.
>
> My patch removes the locking issue.  But if an execution path can
> occur with a lock present, it can also occur when the lock has been
> removed.  That means it may still be possible for the UVC gadget driver
> to try deactivating the UDC after the unbind process has finished or for
> it to try cleaning up a V4L2 node that has been purged.
>
> If either of those really could have happened (as opposed to just
> getting stuck in a deadlock, waiting for a mutex that would never be
> released), then it can still happen with the patch.  Fixing them would
> require changes to the UVC gadget driver.  So the problem may not be
> gone entirely.
>
The current situation can theoretically happen without the deadlock,
yes, but shouldn't happen in practice. UVC's disable/unbind code flow
looks as follows:

1. When disable callback is called, the gadget driver signals the
userspace application to close the V4L2 node.
2. Closing the V4L2 node calls the release callback to clean up
resources. It is this callback that calls into gadget_deactivate and
gets blocked currently (without your patch).
3. Separately, the unbind callback waits on release callback to finish
for 500ms, assuming the userspace application to behave well and close
the node in a reasonable amount of time.
4. If the release callback still hasn't been called, the V4L2 node is
forcefully removed and UVC driver waits for another 1000ms for the
release callback to clean up any pending resources.
5. The unbind process continues regardless of the status of release
callback after waiting at most 1.5s for release.

So the only way to run into the current issue is if the release
callback fails to finish in both step 3 and step 4 (for example, due
to a deadlock) in the span of 1.5s. It is possible, but fairly
unlikely (at least in my limited understanding) for the release
callback to be delayed for quite that long.

Hope that makes some sense!

- Avi.