Re: RFC: add min_num_buffers and clarify V4L2_CID_MIN_BUFFERS_FOR_CAPTURE/OUTPUT

Laurent Pinchart <laurent.pinchart@xxxxxxxxxxxxxxxx> · Tue, 29 Oct 2024 12:22:36 +0200

Hi Hans

On Tue, Oct 29, 2024 at 10:52:39AM +0100, Hans Verkuil wrote:
> On 10/29/24 09:52, Jacopo Mondi wrote:
> > On Tue, Oct 29, 2024 at 09:17:57AM +0100, Hans Verkuil wrote:
> >> On 28/10/2024 16:52, Laurent Pinchart wrote:
> >>> Hi Hans,
> >>>
> >>> On Mon, Oct 28, 2024 at 12:10:22PM +0100, Hans Verkuil wrote:
> >>>> Hi all,
> >>>>
> >>>> This mail thread uncovered some corner cases around how many buffers should be allocated
> >>>> if VIDIOC_REQBUFS with count = 1 is called:
> >>>>
> >>>> https://lore.kernel.org/linux-media/20241003-rp1-cfe-v6-0-d6762edd98a8@xxxxxxxxxxxxxxxx/T/#mc2210597d92b5a0f09fabdac2f7307128aaa9bd8
> >>>
> >>> I'll repeat below some comments I've made in that thread, as they're
> >>> better discussed in the context of this RFC.
> >>>
> >>>> When it comes to the minimum number of buffers there are a number of limitations:
> >>>>
> >>>> 1) The DMA engine needs at least N buffers to be queued before it can start. Typically
> >>>>    this is 0, 1 or 2, and a driver sets this via the vb2_queue min_queued_buffers field.
> >>>>    So if min_queued_buffers = 1, then the DMA engine needs one buffer at all times to
> >>>>    DMA to. Allocating just one buffer would mean the DMA engine can never return that
> >>>>    buffer to userspace (it would just keep recycling the same buffer over and over), so
> >>>>    the minimum must be min_queued_buffers + 1.
> >>>
> >>> I think you're mixing hardware and driver constraints here. Drivers can
> >>> use scratch buffers to relax the hardware requirements, and allow
> >>> userspace operation with less buffers than strictly required by the
> >>> hardware.
> >>>
> >>> The cost of allocating such scratch buffers vary depending on the
> >>> device. When an IOMMU is available, or when the device has a line stride
> >>> that can be set to 0 and supports race-free programming of the stride
> >>> and buffer addresses, the scratch buffer can be as small as a single
> >>> page or a single line. In other cases, a full-frame scratch buffer is
> >>> required, which is costly, and the decision on whether or not to
> >>> allocate such a scratch buffer should probably be taken with userspace
> >>> being involved.
> >>
> >> I honestly don't see why you would want to spend a lot of time on adding
> >> scratch buffer support just to save a bit of memory. Is the use-case of
> >> capturing just a single buffer so common? To me it seems that it only
> >> makes sense to spend effort on this if you only need to capture a single
> >> buffer and never need to stream more buffers.
> > 
> > I can give you two examples I'm currently working with
> > 
> > - A device with a "viewfinder" device node and a "still capture"
> >   capture device. We want to only queue one buffer to the "still
> >   capture" capture device when the user requires to (the usual "tap to
> >   capture"). Adam which was in cc to my patch for RkISP1 that removes
> >   min_queued_buffers was struggling to implement "tap to capture"
> >   support in his application has he had to queue 3 buffers before he
> >   could capture an image from the "still capture" pipe.
> > 
> > - A resource constrained device that only capture one frame
> >   sporadically because it needs to reduce memory pressure and can't
> >   allocate a number of buffers that allows it to keep the queued
> >   buffers queue populated to sustain high frame rates produced by the
> >   sensor

A Pi Zero capturing at 1fps would be an example of this, it's relatively
common.

> > - In libcamera we want the image pipeline to be running even if no
> >   buffers are queued to the capture devices at its end. This means
> >   that we want statistics to be produced by the ISP and parameters to
> >   be consumed even if the frames produced by the ISP are actually
> >   discarded to the scratch buffers.
> > 
> >   We want this because we want the algorithms to keep running even if
> >   users are queuing capture buffers sporadically, to ensure a smaller as
> >   possible recovery period of the 3A algorithms (ideally, there
> >   shouldn't be nothing to recover from as the system is 'live' all the
> >   time) so we want stats to be generated for every frame produced by
> >   the sensor. And we ideally want this from frame#0 without waiting
> >   for users to queue a min number buffer to start the pipeline.
> > 
> > I'm sure in robotics/machine vision there are even more advanced use
> > cases for capturing single buffers in response to events from the
> > external world.

Another advantage of scratch buffers managed by the kernel is faster
recovery from buffer underruns. Paul has implemented this recently in
the imx7-media-csi driver. The DMA engine has two registers that store
buffer addresses (let's name them A and B), and automatically switches
between them. Both have to be programmed to valid addresses. When an
underrun occurs, the kernel uses a single scratch buffer that is
programs in both registers. When a new buffer is queued by userspace,
the driver will try to program it to the register of the inactive side.
For instance, assuming the hardware is capturing to A, the driver
programs B. This is racy as the DMA engine can switch to B at any
point. The driver checks if it has lost the race, and if it has,
programs A with the same address, discards B when it completes, and
returns A to userspace. If it has won the race, it returns A to
userspace, avoiding a delay of one frame.

If we were to manage scratch buffers in userspace, this wouldn't be
possible. Userspace would need to queue two scratch buffers (likely
using the same dmabuf fds), and there would be no fast track option. We
could possibly introduce a flag to QBUF to tell the kernel if a buffer
should be fast tracked, flushing the queue as much as possible. That may
be better than managing the scratch buffers in the kernel, but the
complexity of fast-tracking would still be for the driver to handle.

> Just for the record: I have no problem with drivers implementing
> scratch buffers so you can leave min_queued_buffers at 0. But I don't
> think I would want to enforce it for non-ISP drivers, and it certainly
> won't help existing drivers that set min_queued_buffers to a non-zero
> value since those are old and nobody will change those drivers to
> support scratch buffers.

Don't worry, I won't ask for bttv to implement min_queued_buffers == 0
:-D

Jokes aside, I'm not calling for enforcing min_queued_buffers == 0
globally across all new drivers. It may become a hard requirement for
devices that work with libcamera, but certainly not for other devices
(even if some would benefit from this and should probably do so, but
that's a separate question).

> >> Can you describe the use-case of capturing just a single buffer? Is that
> >> just for testing libcamera? Or is it something that happens all the time
> >> during normal libcamera operation?
> >>
> >> Supporting scratch buffers is a lot of effort for something that is not
> >> needed for normal streaming.
> >>
> >>>
> >>> min_queued_buffers describes how the device operates from a userspace
> >>> point of view, so I don't think it should be considered or documented as
> >>> being a hardware requirement, but a driver requirement.
> >>
> >> It's a hardware and/or driver requirement. It is absolutely not a userspace
> >> requirement. Normal userspace applications that use VIDIOC_REQBUFS and just
> >> stream video will never notice this.
> >>
> >>>
> >>>> 2) Historically VIDIOC_REQBUFS is expected to increase the count value to a number that
> >>>>    ensures the application can smoothly process the video stream. Typically this will
> >>>>    be 3 or 4 (if min_queued_buffers == 2): min_queued_buffers are used by the DMA engine,
> >>>>    one buffer is queued up in vb2, ready to be used by the DMA engine as soon as it
> >>>>    returns a filled buffer to userspace, and one buffer is processed by userspace.
> >>>>
> >>>>    This is to support applications that call VIDIOC_REQBUFS with count = 1 and leave it
> >>>>    to the driver to increment it to a workable value.
> >>>
> >>> Do we know what those applications are ? I'm not disputing the fact that
> >>> this may need to be supported to avoid breaking old userspace, but I
> >>> also think this feature should be phased out for new drivers, especially
> >>> drivers that require a device-specific userspace and therefore won't
> >>> work out of the box with old applications.
> >>
> >> xawtv is one: it will call REQBUFS with count = 2 (so this would fail for
> >> any driver that sets min_queued_buffers to 2), and with count = 1 if it wants
> >> to capture just a single frame.
> >>
> >> 'git grep min_queued_buffers|grep -v videobuf|wc' gives me 83 places where it is
> >> set. Some of those are likely wrong (min_queued_buffers has been abused as a
> >> replacement for min_reqbufs_allocation), but still that's quite a lot.
> >>
> >> Mostly these are older drivers for hardware without an IOMMU and typically for
> >> SDTV capture. So memory is not a consideration for those drivers since a
> >> SDTV buffer is quite small.
> >>
> >>>
> >>>> 3) Stateful codecs in particular have additional requirements beyond the DMA engine
> >>>>    limits due to the fact that they have to keep track of reference buffers and other
> >>>>    codec limitations. As such more buffers are needed, and that number might also vary
> >>>>    based on the specific codec used. The V4L2_CID_MIN_BUFFERS_FOR_CAPTURE/OUTPUT
> >>>>    controls are used to report that. Support for this is required by the stateful codec
> >>>>    API.
> >>>>
> >>>>    The documentation of these controls suggest that these are generic controls, but
> >>>>    as of today they are only used by stateful codec drivers.
> >>>>
> >>>> 4) Some corner cases (mainly/only SDR, I think) where you need more than the usual
> >>>>    3 or 4 buffers since the buffers arrive at a high frequency.
> >>>
> >>> High frame rates is an important feature, but it's also a can of worms.
> >>> V4L2 is lacking the ability to batch multiple frames, we will have to
> >>> address that. Hopefully it could be decoupled from this RFC.
> >>
> >> It's a separate issue indeed. I just mentioned it because I know SDR drivers
> >> use this. They are rarely used, though.
> >>
> >>>
> >>>> Rather than have drivers try to correct the count value (typically incorrectly), the
> >>>> vb2_queue min_reqbufs_allocation field was added to set the minimum number of
> >>>> buffers that VIDIOC_REQBUFS should allocate if count is less than that.
> >>>
> >>> Even if I dislike this feature, I agree it's better implemented through
> >>> min_reqbufs_allocation than by manual calculations in drivers.
> >>>
> >>>> VIDIOC_CREATE_BUFS is not affected by that: if you use CREATE_BUFS you take full control
> >>>> of how many buffers you want to create. It might create fewer buffers if you run out of
> >>>> memory, but never more than requested.
> >>>>
> >>>> But what is missing is that if you use CREATE_BUFS you need to know the value of
> >>>> min_queued_buffers + 1, and that is not exposed.
> >>>>
> >>>> I would propose to add a min_num_buffers field to struct v4l2_create_buffers
> >>>> and add a V4L2_BUF_CAP_SUPPORTS_MIN_NUM_BUFFERS flag to signal the presence of
> >>>> that field. And vb2 can set it to min_queued_buffers + 1.
> >>>
> >>> This would require allocating a buffer first to get the value. Wouldn't
> >>> a read-only control be better ?
> >>
> >> No. You can call CREATE_BUFS with count = 0: in that case it does nothing,
> >> except filling in all those capabilities. It was designed with that in mind
> >> so you have an ioctl that can return all that information.
> >>
> >>>
> >>> Furthermore, I would rather provide the min_queued_buffers value instead
> >>> of min_queued_buffers + 1. The V4L2 API should provide userspace with
> >>> information it needs to make informed decisions, but not make those
> >>> decisions in behalf of userspace. It's up to applications to add 1 or
> >>> more buffers depending on their use case.
> >>
> >> I would definitely want more opinions on this. What's the point of returning
> >> min_queued_buffers and then creating that many buffers and still not be able
> >> to stream?
> > 
> > There are use cases for memory constrained systems where buffers are
> > only queued sporadically. It might come from the requirement of
> > allocating less buffers as possible [*] or because processing frames
> > takes a longer time and maintaing the buffer queue populated for sustained
> > frame-rate operations would require a lot of buffers to be reserved.
> 
> You misunderstood me, sorry for that.
> 
> My question was about whether CREATE_BUFS would report 'min_queued_buffers'
> (so >= 0) or min_queued_buffers + 1 (so >= 1). In the latter case you can
> pass that value on to REQBUFS. In the first case you would have to add 1 to
> it yourself if you want to use it with REQBUFS. Personally I think that is
> very confusing.

I'd for for min_queued_buffers, as the "+ 1" is a policy decision that
applications should handle. In many cases, applications will want at
least a +2 if the consumer of the buffer also holds to one buffer (as is
common when displaying captured images). I think that giving
applications min_queued_buffers + 1 and explaining that they may need to
add more in some cases is more confusing to me than giving them
min_queued_buffers and explaining this is the number of buffers the
driver will hold on to.

> > In general, we can't predict the use cases in which a driver will be
> > used, so informing user-space about the actual requirements without
> > trying to hint what they should do seems better to me,
> > 
> > [*] I understand that allocating a full scratch buffer in the driver
> > kind of goes in the opposite direction of "not wasting memory" but if
> > the DMA engine does not support discarding frames in HW, a single buffer
> > in kernel space avoids a larger allocation in user space
> > 
> >> Can you think of a scenario (e.g. in libcamera or elsewhere) where that makes
> >> sense?
> >>
> >> Also, will the average V4L2 user have the knowledge to understand that? You
> >> have that knowledge, but I think for anyone else it would be really confusing.
> >>
> >>> I think we also need to discuss policies regarding scratch buffer
> >>> allocation in the context of this RFC. When the hardware supports small
> >>> scratch buffers, I would like to make it mandatory for drivers to do so
> >>> and support min_queued_buffers = 0.
> >>
> >> I would first like to know the use-case (as I mentioned above).
> >>
> >> For the type of drivers I mostly work with (video receivers), it would just
> >> be a lot of work for no gain. But perhaps for camera pipelines it does make
> >> sense?
> >>
> >>> When scratch buffers are expensive, do we want to still support them in
> >>> the kernel, perhaps in a way controlled by userspace ? A userspace that
> >>> can guarantee it will always provide min_queued_buffers + 1 buffers
> >>> could indicate so and avoid scratch buffer allocation, while a userspace
> >>> that can't provide that guarantee would get scratch buffers from the
> >>> kernel.
> >>
> >> That is really the difference between using VIDIOC_REQBUFS and VIDIOC_CREATE_BUFS.
> >> I.e., userspace can already choose this.
> >>
> >> Just to clarify the reason for this RFC: the current situation is messy. There
> >> is a lot of history and a lot of older drivers do not always do the right thing.
> >>
> >> With this RFC I would like to get a consensus of how it should work. After that
> >> I want to implement any missing bits and improve the documentation, and finally
> >> go through the drivers and at least try to make them behave consistently.
> > 
> > my2c: if CREATE_BUFFERS(0) allows to retrieve min_queued_buffers to
> > allow userspace make informed decisions about how many buffers to
> > allocate to at least get streaming going, I would be happy with such
> > API more than with a control.
> > 
> > When it comes to scratch buffers usage, I'm not sure we can enforce it
> > as a requirement (or even try to provide some helper in the core for
> > drivers) but I defintately see use cases for applicating queueing buffers
> > sporadically and for driver being ready to discard frames without
> > stalling or delaying the start of the capture pipeline operations.
> > 
> >> Also I want to improve v4l2-compliance to test more corner cases, especially
> >> if you use CREATE_BUFS instead of REQBUFS (I already have a patch for that
> >> ready).
> >>
> >> The work Benjamin did on increasing the max number of supported buffers and the
> >> REMOVE_BUFS ioctl uncovered a lot of that messy history, and it is clear we need
> >> to try and clarify how it should work.
> >>
> >>>> The second proposal is to explicitly document that the V4L2_CID_MIN_BUFFERS_FOR_CAPTURE/OUTPUT
> >>>> are for stateful codec support only, at least for now.
> >>
> >> I just discovered that v4l2-compliance and v4l2-ctl do not honor these controls
> >> for stateful codecs. That's something that needs to be fixed.
> >>
> >> There is also one other item that I would like to discuss: the vb2 queue_setup
> >> callback is currently used for both REQBUFS and CREATE_BUFS, and it remains
> >> confusing for drivers how to use it exactly. I am inclined to redesign that
> >> part, most likely splitting it in two: either one callback for REQBUFS and one
> >> for CREATE_BUFS, or alternatively, one callback when allocating buffers for
> >> the first time (so REQBUFS and when CREATE_BUFS is called for the first time,
> >> i.e. when no buffers are allocated yet), and one callback when adding additional
> >> buffers. I would have to think about this, and probably experiment a bit.
> >>
> >>>> If this is in place, then min_reqbufs_allocation should be set to a sane number of
> >>>> buffers (i.e. typically 3 or 4), and if you want precise control, use VIDIOC_CREATE_BUFS.

-- 
Regards,

Laurent Pinchart