Re: [PATCH v2 2/4] media: uvcvideo: Do not set an async control owned by other fh

Ricardo Ribalda <ribalda@xxxxxxxxxxxx> · Mon, 2 Dec 2024 11:11:22 +0100

On Mon, 2 Dec 2024 at 09:38, Laurent Pinchart
<laurent.pinchart@xxxxxxxxxxxxxxxx> wrote:
>
> On Mon, Dec 02, 2024 at 08:28:48AM +0100, Ricardo Ribalda wrote:
> > On Mon, 2 Dec 2024 at 01:19, Laurent Pinchart
> > <laurent.pinchart@xxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, Nov 29, 2024 at 11:18:54PM +0100, Ricardo Ribalda wrote:
> > > > On Fri, 29 Nov 2024 at 23:03, Laurent Pinchart wrote:
> > > > > On Fri, Nov 29, 2024 at 07:47:31PM +0100, Ricardo Ribalda wrote:
> > > > > > Before we all go on a well deserved weekend, let me recap what we
> > > > > > know. If I did not get something correctly, let me know.
> > > > > >
> > > > > > 1) Well behaved devices do not allow to set or get an incomplete async
> > > > > > control. They will stall instead (ref: Figure 2-21 in UVC 1.5 )
> > > > > > 2) Both Laurent and Ricardo consider that there is a big chance that
> > > > > > some camera modules do not implement this properly. (ref: years of
> > > > > > crying over broken module firmware :) )
> > > > > >
> > > > > > 3) ctrl->handle is designed to point to the fh that originated the
> > > > > > control. So the logic can decide if the originator needs to be
> > > > > > notified or not. (ref: uvc_ctrl_send_event() )
> > > > > > 4) Right now we replace the originator in ctrl->handle for unfinished
> > > > > > async controls.  (ref:
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/media/usb/uvc/uvc_ctrl.c#n2050)
> > > > > >
> > > > > > My interpretation is that:
> > > > > > A) We need to change 4). We shall not change the originator of
> > > > > > unfinished ctrl->handle.
> > > > > > B) Well behaved cameras do not need the patch "Do not set an async
> > > > > > control owned by another fh"
> > > > > > C) For badly behaved cameras, it is fine if we slightly break the
> > > > > > v4l2-compliance in corner cases, if we do not break any internal data
> > > > > > structure.
> > > > >
> > > > > The fact that some devices may not implement the documented behaviour
> > > > > correctly may not be a problem. Well-behaved devices will stall, which
> > > > > means we shouldn't query the device while as async update is in
> > > > > progress. Badly-behaved devices, whatever they do when queried, should
> > > > > not cause any issue if we don't query them.
> > > >
> > > > I thought we could detect the stall and return safely. Isn't that the case?
> > >
> > > We could, but if we know the device will stall anyway, is there a reason
> > > not to avoid issuing the request in the first place ?
> >
> > Because there are always going to be devices that do not send the
> > event when the control has finished.
> >
> > It is impossible to know the state of the device: Is the zoom still
> > moving or the device is not compliant?
>
> Another type of broken behaviour would be devices that don't correctly
> handle UVC_GET and UVC_CUR requests while an async update is in
> progress. If you issue those requests, those devices will break. Are
> they more or less important than devices that don't send an event ?
>
> All of those are partly theoretical problems. I'd rather keep it simple
> for now until we get feedback about broken devices.
>
> > > > Why we have not seen issues with this?
> > >
> > > I haven't tested a PTZ device for a very long time, and you would need
> > > to hit a small time window to see the issue.
> >
> > Not that small, some devices take up to 10 seconds to go from the
> > smallest zoom to the biggest zoom.
> > I'd say that v4l2-ctl has a very big chance to hit any error.
> >
> > BTW, homing can take an even longer time.
> >
> > > > > We should not send GET_CUR and SET_CUR requests to the device while an
> > > > > async update is in progress, and use cached values instead. When we
> > > > > receive the async update event, we should clear the cache. This will be
> > > > > the same for both well-behaved and badly-behaved devices, so we can
> > > > > expose the same behaviour towards userspace.
> > > >
> > > > seting ctrl->loaded = 0 when we get an event sounds like a good idea
> > > > and something we can implement right away.
> > > > If I have to resend the set I will add it to the end.
> > > >
> > > > > We possibly also need some kind of timeout mechanism to cope with the
> > > > > async update event not being delivered by the device.
> > > >
> > > > This is the part that worries me the most:
> > > > - timeouts make the code fragile
> > > > - What is a good value for timeout? 1 second, 30, 300? I do not think
> > > > that we can find a value.
> > >
> > > I've been thinking about the implementation of uvc_fh cleanup over the
> > > weekend, and having a timeout would have the nice advantage that we
> > > could reference-count uvc_fh instead of implementing a cleanup that
> > > walks over all controls when closing a file handle. I think it would
> > > make the code simpler, and possibly safer too.
> >
> > The problem with a timeout is:
> > - We do not know what is the right value for the timeout.
> > - We need to have one timeout per control, or implement a timeout
> > dispatcher mechanism, and that is racy by nature
> > - It will require introducing a new locking mechanism to avoid races
> > between the timeout handler and the event completion
>
> Timeouts don't come for free, that's for sure.
>
> > - It introduces a new lifetime in the driver... I'd argue that it is
> > best to have less, not more.
>
> That I disagree with, my experience with V4L2 (as well as with DRM, or
> actually the kernel in general) is that we would be in a much better
> place today if object lifetimes were handled with reference counting
> more widely.
>
> > I do not see many benefits....
>
> The main benefit is simplifying the close() implementation and
> complexity, as well as lowering lock contention (whether the latter
> brings a real advantage in practice, I haven't checked)..
>
> > What we can introduce on top of my set is something like this (just
> > pseudocode, do not scream at me :P)  That code can prevent stalls and
> > will work with misbehaving hardware.... (But I still do not know a
> > good value for timeout) and solve some of the issues that I mention.
> >
> > diff --git a/drivers/media/usb/uvc/uvc_ctrl.c b/drivers/media/usb/uvc/uvc_ctrl.c
> > index f0e8a436a306..a55bd4b3ac07 100644
> > --- a/drivers/media/usb/uvc/uvc_ctrl.c
> > +++ b/drivers/media/usb/uvc/uvc_ctrl.c
> > @@ -1132,6 +1132,9 @@ static int __uvc_ctrl_get(struct uvc_video_chain *chain,
> >         if ((ctrl->info.flags & UVC_CTRL_FLAG_GET_CUR) == 0)
> >                 return -EACCES;
> >
> > +       if (ctrl->handle && ( NOW - ctrl->async_start_time ) < TIMEOUT)
> > +               return -EBUSY;
> > +
> >         ret = __uvc_ctrl_load_cur(chain, ctrl);
> >         if (ret < 0)
> >                 return ret;
> > @@ -1591,6 +1594,7 @@ static int uvc_ctrl_get_handle(struct uvc_fh
> > *handle, struct uvc_control *ctrl)
> >                 return -EBUSY;
> >
> >         ctrl->handle = handle;
> > +       ctrl->async_start_time = NOW;
> >         handle->pending_async_ctrls++;
> >         return 0;
> >  }
> > @@ -1982,6 +1986,9 @@ int uvc_ctrl_set(struct uvc_fh *handle,
> >         if (!(ctrl->info.flags & UVC_CTRL_FLAG_SET_CUR))
> >                 return -EACCES;
> >
> > +       if (ctrl->handle && ( NOW - ctrl->async_start_time ) < TIMEOUT)
> > +               return -EBUSY;
>
> If I understand you correctly, this is meant to prevent accessing the
> device for an initial TIMEOUT period, based on the knowledge it will
> very likely STALL. After the TIMEOUT, if the async set is still not
> complete, SET_CUR will be issued, and a STALL will likely occur.
>
> If we device to standardize on -EBUSY while the async set is in
> progress, we need to do so before and after the timeout.
> uvc_query_ctrl() will return -EBUSY in case of a STALL if the device
> reports the "Not ready" error. This is what I expect a device to report
> in this case. Of course the UVC specification, in section 2.4.4,
> documents that the device will update the request error code control,
> but doesn't tell which value should be used :-S I suppose we'll handle
> broken devices if the need arise.
>
> > +
> >         /* Clamp out of range values. */
> >         switch (mapping->v4l2_type) {
> >         case V4L2_CTRL_TYPE_INTEGER:
> > diff --git a/drivers/media/usb/uvc/uvcvideo.h b/drivers/media/usb/uvc/uvcvideo.h
> > index e0e4f099a210..5c82fae94dff 100644
> > --- a/drivers/media/usb/uvc/uvcvideo.h
> > +++ b/drivers/media/usb/uvc/uvcvideo.h
> > @@ -154,6 +154,8 @@ struct uvc_control {
> >                                  * File handle that initially changed the
> >                                  * async control.
> >                                  */
> > +
> > +       TIME async_start_time;
> >  };
> >
> >
> > In any case. Can we solve this in incremental steps? I think the
> > current patch seems simple and correct. We can introduce the timeout
> > later.
>
> The main reason for a timeout was to replace walking over all controls
> at close() time with reference counting of uvc_fh, which is an
> alternative approach to the current patch. Given that picking a good
> timeout value is difficult, and that the current implementation already
> skips the walk in the most common case when no async set is pending, I
> suppose we could start with that.

Cool. I am going to send a new version of:
https://patchwork.linuxtv.org/project/linux-media/patch/20241129-uvc-fix-async-v4-1-f23784dba80f@xxxxxxxxxxxx/

using  uvc_ctrl_get_handle / uvc_ctrl_put_handle ad lockdep annotations.

Hopefully in the future we can get rid of: uvc_ctrl_cleanup_fh()

Thanks!
>
> > > > > Regarding the userspace behaviour during an auto-update, we have
> > > > > multiple options:
> > > > >
> > > > > For control get,
> > > > >
> > > > > - We can return -EBUSY
> > > > > - We can return the old value from the cache
> > > > > - We can return the new value fromt he cache
> > > > >
> > > > > Returning -EBUSY would be simpler to implement.
> > > >
> > > > Not only easy, I think it is the most correct,
> > > >
> > > > > I don't think the behaviour should depend on whether the control is read
> > > > > on the file handle that initiated the async operation or on a different
> > > > > file handle.
> > > > >
> > > > > For control set, I don't think we can do much else than returning
> > > > > -EBUSY, regardless of which file handle the control is set on.
> > > >
> > > > ACK.
> > > >
> > > > > > I will send a new version with my interpretation.
> > > > > >
> > > > > > Thanks for a great discussion
> > > >
> > > > Looking with some perspective... I believe that we should look into
> > > > the "userspace behaviour for auto controls" in a different patchset.
> > > > It is slightly unrelated to this discussion.
>
> --
> Regards,
>
> Laurent Pinchart

-- 
Ricardo Ribalda