Hi Hans, On Tue, Oct 29, 2024 at 11:04:43AM +0100, Hans Verkuil wrote: > On 10/29/24 10:04, Hans Verkuil wrote: > > On 29/10/2024 09:17, Hans Verkuil wrote: > >> On 28/10/2024 16:52, Laurent Pinchart wrote: > >>> On Mon, Oct 28, 2024 at 12:10:22PM +0100, Hans Verkuil wrote: > >>>> Hi all, > >>>> > >>>> This mail thread uncovered some corner cases around how many buffers should be allocated > >>>> if VIDIOC_REQBUFS with count = 1 is called: > >>>> > >>>> https://lore.kernel.org/linux-media/20241003-rp1-cfe-v6-0-d6762edd98a8@xxxxxxxxxxxxxxxx/T/#mc2210597d92b5a0f09fabdac2f7307128aaa9bd8 > >>> > >>> I'll repeat below some comments I've made in that thread, as they're > >>> better discussed in the context of this RFC. > >>> > >>>> When it comes to the minimum number of buffers there are a number of limitations: > >>>> > >>>> 1) The DMA engine needs at least N buffers to be queued before it can start. Typically > >>>> this is 0, 1 or 2, and a driver sets this via the vb2_queue min_queued_buffers field. > >>>> So if min_queued_buffers = 1, then the DMA engine needs one buffer at all times to > >>>> DMA to. Allocating just one buffer would mean the DMA engine can never return that > >>>> buffer to userspace (it would just keep recycling the same buffer over and over), so > >>>> the minimum must be min_queued_buffers + 1. > >>> > >>> I think you're mixing hardware and driver constraints here. Drivers can > >>> use scratch buffers to relax the hardware requirements, and allow > >>> userspace operation with less buffers than strictly required by the > >>> hardware. > >>> > >>> The cost of allocating such scratch buffers vary depending on the > >>> device. When an IOMMU is available, or when the device has a line stride > >>> that can be set to 0 and supports race-free programming of the stride > >>> and buffer addresses, the scratch buffer can be as small as a single > >>> page or a single line. In other cases, a full-frame scratch buffer is > >>> required, which is costly, and the decision on whether or not to > >>> allocate such a scratch buffer should probably be taken with userspace > >>> being involved. > >> > >> I honestly don't see why you would want to spend a lot of time on adding > >> scratch buffer support just to save a bit of memory. Is the use-case of > >> capturing just a single buffer so common? To me it seems that it only > >> makes sense to spend effort on this if you only need to capture a single > >> buffer and never need to stream more buffers. > >> > >> Can you describe the use-case of capturing just a single buffer? Is that > >> just for testing libcamera? Or is it something that happens all the time > >> during normal libcamera operation? > >> > >> Supporting scratch buffers is a lot of effort for something that is not > >> needed for normal streaming. > >> > >>> > >>> min_queued_buffers describes how the device operates from a userspace > >>> point of view, so I don't think it should be considered or documented as > >>> being a hardware requirement, but a driver requirement. > >> > >> It's a hardware and/or driver requirement. It is absolutely not a userspace > >> requirement. Normal userspace applications that use VIDIOC_REQBUFS and just > >> stream video will never notice this. > >> > >>> > >>>> 2) Historically VIDIOC_REQBUFS is expected to increase the count value to a number that > >>>> ensures the application can smoothly process the video stream. Typically this will > >>>> be 3 or 4 (if min_queued_buffers == 2): min_queued_buffers are used by the DMA engine, > >>>> one buffer is queued up in vb2, ready to be used by the DMA engine as soon as it > >>>> returns a filled buffer to userspace, and one buffer is processed by userspace. > >>>> > >>>> This is to support applications that call VIDIOC_REQBUFS with count = 1 and leave it > >>>> to the driver to increment it to a workable value. > >>> > >>> Do we know what those applications are ? I'm not disputing the fact that > >>> this may need to be supported to avoid breaking old userspace, but I > >>> also think this feature should be phased out for new drivers, especially > >>> drivers that require a device-specific userspace and therefore won't > >>> work out of the box with old applications. > >> > >> xawtv is one: it will call REQBUFS with count = 2 (so this would fail for > >> any driver that sets min_queued_buffers to 2), and with count = 1 if it wants > >> to capture just a single frame. > >> > >> 'git grep min_queued_buffers|grep -v videobuf|wc' gives me 83 places where it is > >> set. Some of those are likely wrong (min_queued_buffers has been abused as a > >> replacement for min_reqbufs_allocation), but still that's quite a lot. > >> > >> Mostly these are older drivers for hardware without an IOMMU and typically for > >> SDTV capture. So memory is not a consideration for those drivers since a > >> SDTV buffer is quite small. > >> > >>> > >>>> 3) Stateful codecs in particular have additional requirements beyond the DMA engine > >>>> limits due to the fact that they have to keep track of reference buffers and other > >>>> codec limitations. As such more buffers are needed, and that number might also vary > >>>> based on the specific codec used. The V4L2_CID_MIN_BUFFERS_FOR_CAPTURE/OUTPUT > >>>> controls are used to report that. Support for this is required by the stateful codec > >>>> API. > >>>> > >>>> The documentation of these controls suggest that these are generic controls, but > >>>> as of today they are only used by stateful codec drivers. > >>>> > >>>> 4) Some corner cases (mainly/only SDR, I think) where you need more than the usual > >>>> 3 or 4 buffers since the buffers arrive at a high frequency. > >>> > >>> High frame rates is an important feature, but it's also a can of worms. > >>> V4L2 is lacking the ability to batch multiple frames, we will have to > >>> address that. Hopefully it could be decoupled from this RFC. > >> > >> It's a separate issue indeed. I just mentioned it because I know SDR drivers > >> use this. They are rarely used, though. > >> > >>>> Rather than have drivers try to correct the count value (typically incorrectly), the > >>>> vb2_queue min_reqbufs_allocation field was added to set the minimum number of > >>>> buffers that VIDIOC_REQBUFS should allocate if count is less than that. > >>> > >>> Even if I dislike this feature, I agree it's better implemented through > >>> min_reqbufs_allocation than by manual calculations in drivers. > >>> > >>>> VIDIOC_CREATE_BUFS is not affected by that: if you use CREATE_BUFS you take full control > >>>> of how many buffers you want to create. It might create fewer buffers if you run out of > >>>> memory, but never more than requested. > >>>> > >>>> But what is missing is that if you use CREATE_BUFS you need to know the value of > >>>> min_queued_buffers + 1, and that is not exposed. > >>>> > >>>> I would propose to add a min_num_buffers field to struct v4l2_create_buffers > >>>> and add a V4L2_BUF_CAP_SUPPORTS_MIN_NUM_BUFFERS flag to signal the presence of > >>>> that field. And vb2 can set it to min_queued_buffers + 1. > >>> > >>> This would require allocating a buffer first to get the value. Wouldn't > >>> a read-only control be better ? > >> > >> No. You can call CREATE_BUFS with count = 0: in that case it does nothing, > >> except filling in all those capabilities. It was designed with that in mind > >> so you have an ioctl that can return all that information. > >> > >>> Furthermore, I would rather provide the min_queued_buffers value instead > >>> of min_queued_buffers + 1. The V4L2 API should provide userspace with > >>> information it needs to make informed decisions, but not make those > >>> decisions in behalf of userspace. It's up to applications to add 1 or > >>> more buffers depending on their use case. > >> > >> I would definitely want more opinions on this. What's the point of returning > >> min_queued_buffers and then creating that many buffers and still not be able > >> to stream? > >> > >> Can you think of a scenario (e.g. in libcamera or elsewhere) where that makes > >> sense? > >> > >> Also, will the average V4L2 user have the knowledge to understand that? You > >> have that knowledge, but I think for anyone else it would be really confusing. > >> > >>> I think we also need to discuss policies regarding scratch buffer > >>> allocation in the context of this RFC. When the hardware supports small > >>> scratch buffers, I would like to make it mandatory for drivers to do so > >>> and support min_queued_buffers = 0. > >> > >> I would first like to know the use-case (as I mentioned above). > >> > >> For the type of drivers I mostly work with (video receivers), it would just > >> be a lot of work for no gain. But perhaps for camera pipelines it does make > >> sense? > >> > >>> When scratch buffers are expensive, do we want to still support them in > >>> the kernel, perhaps in a way controlled by userspace ? A userspace that > >>> can guarantee it will always provide min_queued_buffers + 1 buffers > >>> could indicate so and avoid scratch buffer allocation, while a userspace > >>> that can't provide that guarantee would get scratch buffers from the > >>> kernel. > >> > >> That is really the difference between using VIDIOC_REQBUFS and VIDIOC_CREATE_BUFS. > >> I.e., userspace can already choose this. > >> > >> Just to clarify the reason for this RFC: the current situation is messy. There > >> is a lot of history and a lot of older drivers do not always do the right thing. > >> > >> With this RFC I would like to get a consensus of how it should work. After that > >> I want to implement any missing bits and improve the documentation, and finally > >> go through the drivers and at least try to make them behave consistently. > >> > >> Also I want to improve v4l2-compliance to test more corner cases, especially > >> if you use CREATE_BUFS instead of REQBUFS (I already have a patch for that > >> ready). > >> > >> The work Benjamin did on increasing the max number of supported buffers and the > >> REMOVE_BUFS ioctl uncovered a lot of that messy history, and it is clear we need > >> to try and clarify how it should work. > >> > >>>> The second proposal is to explicitly document that the V4L2_CID_MIN_BUFFERS_FOR_CAPTURE/OUTPUT > >>>> are for stateful codec support only, at least for now. > >> > >> I just discovered that v4l2-compliance and v4l2-ctl do not honor these controls > >> for stateful codecs. That's something that needs to be fixed. > >> > >> There is also one other item that I would like to discuss: the vb2 queue_setup > >> callback is currently used for both REQBUFS and CREATE_BUFS, and it remains > >> confusing for drivers how to use it exactly. I am inclined to redesign that > >> part, most likely splitting it in two: either one callback for REQBUFS and one > >> for CREATE_BUFS, or alternatively, one callback when allocating buffers for > >> the first time (so REQBUFS and when CREATE_BUFS is called for the first time, > >> i.e. when no buffers are allocated yet), and one callback when adding additional > >> buffers. I would have to think about this, and probably experiment a bit. > > What vb2 needs is information about the buffers based on the current pixel format: > i.e. the number of planes and the size of each plane. Currently it relies on > queue_setup to figure that out and either set or verify the sizes. > > It might be much easier if there is a callback for drivers to provide that information, > and let vb2 handle the actual REQBUFS/CREATE_BUFS handling. Every driver basically > does the same tests (or is supposed to, at least). Moving that to vb2 should simplify > drivers. I like the sound of this. > > Actually, this really has to be addressed since this is broken: you can call > > CREATE_BUFS as a replacement for REQBUFS, but it will act like REQBUFS and > > the requested sizes are not honored. > > > > I added tests for this to v4l2-compliance (locally only), and it fails on > > everything. At least it's consistent :-D > > It should not be news to anyone that I hate the CREATE_BUFS ioctl API. I posted > > an RFC for a VIDIOC_ADD_BUFS replacement earlier this year: > > > > https://lore.kernel.org/linux-media/243a66ad-6dff-4a43-ab03-e01d1038fe8a@xxxxxxxxx/ > > > > I wonder if we should restrict CREATE_BUFS to only be used after calling > > REQBUFS, and to a proper job for ADD_BUFS. Because given the vb2 design flaw > > I am not sure if it can be worked around. Or if we even want that. > > > > What a mess. > > I had a brainwave and I think this can be solved in the vb2 framework. This > will require that a new V4L2_BUF_CAP_ flag is introduced to indicate that > CREATE_BUFS does the right thing when called without calling REQBUFS first. > > I don't dare restrict CREATE_BUFS as I suggested above: this ioctl has been > around for over a decade and I have no idea how it is used 'in the field'. We don't have to restrict it, we could consider it deprecated, avoid touching it, and focus on a new VIDIOC_ADD_BUFS. Nothing would break if we keep the current VIDIOC_CREATE_BUFS behaviour, even if it's badly designed and/or implemented. That being said, I don't oppose a flag, as long as there's a way for userspace to determine if the flag is supported. > I think I will work on the issues described in this RFC tomorrow and see if > I can post an RFC patch series so we can continue the discussion based on > actual code. Thank you for working on this. > > Looking at the kernel history, CREATE_BUFS was added back in 2011 and the first > > very simple v4l2-compliance tests were added in 2012. > > > > Moral: whenever a new uAPI is added, make sure it you make really good compliance > > tests as well. More than that, we should have real users, not just compliance or test tools. > >>>> If this is in place, then min_reqbufs_allocation should be set to a sane number of > >>>> buffers (i.e. typically 3 or 4), and if you want precise control, use VIDIOC_CREATE_BUFS. -- Regards, Laurent Pinchart