Hi Hans, On 10/31/2024 1:55 PM, Hans Verkuil wrote: > On 29/10/2024 10:04, Hans Verkuil wrote: >> On 29/10/2024 09:17, Hans Verkuil wrote: >>> On 28/10/2024 16:52, Laurent Pinchart wrote: >>>> Hi Hans, >>>> >>>> On Mon, Oct 28, 2024 at 12:10:22PM +0100, Hans Verkuil wrote: >>>>> Hi all, >>>>> >>>>> This mail thread uncovered some corner cases around how many buffers should be allocated >>>>> if VIDIOC_REQBUFS with count = 1 is called: >>>>> >>>>> https://lore.kernel.org/linux-media/20241003-rp1-cfe-v6-0-d6762edd98a8@xxxxxxxxxxxxxxxx/T/#mc2210597d92b5a0f09fabdac2f7307128aaa9bd8 >>>> >>>> I'll repeat below some comments I've made in that thread, as they're >>>> better discussed in the context of this RFC. >>>> >>>>> When it comes to the minimum number of buffers there are a number of limitations: >>>>> >>>>> 1) The DMA engine needs at least N buffers to be queued before it can start. Typically >>>>> this is 0, 1 or 2, and a driver sets this via the vb2_queue min_queued_buffers field. >>>>> So if min_queued_buffers = 1, then the DMA engine needs one buffer at all times to >>>>> DMA to. Allocating just one buffer would mean the DMA engine can never return that >>>>> buffer to userspace (it would just keep recycling the same buffer over and over), so >>>>> the minimum must be min_queued_buffers + 1. >>>> >>>> I think you're mixing hardware and driver constraints here. Drivers can >>>> use scratch buffers to relax the hardware requirements, and allow >>>> userspace operation with less buffers than strictly required by the >>>> hardware. >>>> >>>> The cost of allocating such scratch buffers vary depending on the >>>> device. When an IOMMU is available, or when the device has a line stride >>>> that can be set to 0 and supports race-free programming of the stride >>>> and buffer addresses, the scratch buffer can be as small as a single >>>> page or a single line. In other cases, a full-frame scratch buffer is >>>> required, which is costly, and the decision on whether or not to >>>> allocate such a scratch buffer should probably be taken with userspace >>>> being involved. >>> >>> I honestly don't see why you would want to spend a lot of time on adding >>> scratch buffer support just to save a bit of memory. Is the use-case of >>> capturing just a single buffer so common? To me it seems that it only >>> makes sense to spend effort on this if you only need to capture a single >>> buffer and never need to stream more buffers. >>> >>> Can you describe the use-case of capturing just a single buffer? Is that >>> just for testing libcamera? Or is it something that happens all the time >>> during normal libcamera operation? >>> >>> Supporting scratch buffers is a lot of effort for something that is not >>> needed for normal streaming. >>> >>>> >>>> min_queued_buffers describes how the device operates from a userspace >>>> point of view, so I don't think it should be considered or documented as >>>> being a hardware requirement, but a driver requirement. >>> >>> It's a hardware and/or driver requirement. It is absolutely not a userspace >>> requirement. Normal userspace applications that use VIDIOC_REQBUFS and just >>> stream video will never notice this. >>> >>>> >>>>> 2) Historically VIDIOC_REQBUFS is expected to increase the count value to a number that >>>>> ensures the application can smoothly process the video stream. Typically this will >>>>> be 3 or 4 (if min_queued_buffers == 2): min_queued_buffers are used by the DMA engine, >>>>> one buffer is queued up in vb2, ready to be used by the DMA engine as soon as it >>>>> returns a filled buffer to userspace, and one buffer is processed by userspace. >>>>> >>>>> This is to support applications that call VIDIOC_REQBUFS with count = 1 and leave it >>>>> to the driver to increment it to a workable value. >>>> >>>> Do we know what those applications are ? I'm not disputing the fact that >>>> this may need to be supported to avoid breaking old userspace, but I >>>> also think this feature should be phased out for new drivers, especially >>>> drivers that require a device-specific userspace and therefore won't >>>> work out of the box with old applications. >>> >>> xawtv is one: it will call REQBUFS with count = 2 (so this would fail for >>> any driver that sets min_queued_buffers to 2), and with count = 1 if it wants >>> to capture just a single frame. >>> >>> 'git grep min_queued_buffers|grep -v videobuf|wc' gives me 83 places where it is >>> set. Some of those are likely wrong (min_queued_buffers has been abused as a >>> replacement for min_reqbufs_allocation), but still that's quite a lot. >>> >>> Mostly these are older drivers for hardware without an IOMMU and typically for >>> SDTV capture. So memory is not a consideration for those drivers since a >>> SDTV buffer is quite small. >>> >>>> >>>>> 3) Stateful codecs in particular have additional requirements beyond the DMA engine >>>>> limits due to the fact that they have to keep track of reference buffers and other >>>>> codec limitations. As such more buffers are needed, and that number might also vary >>>>> based on the specific codec used. The V4L2_CID_MIN_BUFFERS_FOR_CAPTURE/OUTPUT >>>>> controls are used to report that. Support for this is required by the stateful codec >>>>> API. >>>>> >>>>> The documentation of these controls suggest that these are generic controls, but >>>>> as of today they are only used by stateful codec drivers. >>>>> >>>>> 4) Some corner cases (mainly/only SDR, I think) where you need more than the usual >>>>> 3 or 4 buffers since the buffers arrive at a high frequency. >>>> >>>> High frame rates is an important feature, but it's also a can of worms. >>>> V4L2 is lacking the ability to batch multiple frames, we will have to >>>> address that. Hopefully it could be decoupled from this RFC. >>> >>> It's a separate issue indeed. I just mentioned it because I know SDR drivers >>> use this. They are rarely used, though. >>> >>>> >>>>> Rather than have drivers try to correct the count value (typically incorrectly), the >>>>> vb2_queue min_reqbufs_allocation field was added to set the minimum number of >>>>> buffers that VIDIOC_REQBUFS should allocate if count is less than that. >>>> >>>> Even if I dislike this feature, I agree it's better implemented through >>>> min_reqbufs_allocation than by manual calculations in drivers. >>>> >>>>> VIDIOC_CREATE_BUFS is not affected by that: if you use CREATE_BUFS you take full control >>>>> of how many buffers you want to create. It might create fewer buffers if you run out of >>>>> memory, but never more than requested. >>>>> >>>>> But what is missing is that if you use CREATE_BUFS you need to know the value of >>>>> min_queued_buffers + 1, and that is not exposed. >>>>> >>>>> I would propose to add a min_num_buffers field to struct v4l2_create_buffers >>>>> and add a V4L2_BUF_CAP_SUPPORTS_MIN_NUM_BUFFERS flag to signal the presence of >>>>> that field. And vb2 can set it to min_queued_buffers + 1. >>>> >>>> This would require allocating a buffer first to get the value. Wouldn't >>>> a read-only control be better ? >>> >>> No. You can call CREATE_BUFS with count = 0: in that case it does nothing, >>> except filling in all those capabilities. It was designed with that in mind >>> so you have an ioctl that can return all that information. >>> >>>> >>>> Furthermore, I would rather provide the min_queued_buffers value instead >>>> of min_queued_buffers + 1. The V4L2 API should provide userspace with >>>> information it needs to make informed decisions, but not make those >>>> decisions in behalf of userspace. It's up to applications to add 1 or >>>> more buffers depending on their use case. >>> >>> I would definitely want more opinions on this. What's the point of returning >>> min_queued_buffers and then creating that many buffers and still not be able >>> to stream? >>> >>> Can you think of a scenario (e.g. in libcamera or elsewhere) where that makes >>> sense? >>> >>> Also, will the average V4L2 user have the knowledge to understand that? You >>> have that knowledge, but I think for anyone else it would be really confusing. >>> >>>> >>>> I think we also need to discuss policies regarding scratch buffer >>>> allocation in the context of this RFC. When the hardware supports small >>>> scratch buffers, I would like to make it mandatory for drivers to do so >>>> and support min_queued_buffers = 0. >>> >>> I would first like to know the use-case (as I mentioned above). >>> >>> For the type of drivers I mostly work with (video receivers), it would just >>> be a lot of work for no gain. But perhaps for camera pipelines it does make >>> sense? >>> >>>> When scratch buffers are expensive, do we want to still support them in >>>> the kernel, perhaps in a way controlled by userspace ? A userspace that >>>> can guarantee it will always provide min_queued_buffers + 1 buffers >>>> could indicate so and avoid scratch buffer allocation, while a userspace >>>> that can't provide that guarantee would get scratch buffers from the >>>> kernel. >>> >>> That is really the difference between using VIDIOC_REQBUFS and VIDIOC_CREATE_BUFS. >>> I.e., userspace can already choose this. >>> >>> Just to clarify the reason for this RFC: the current situation is messy. There >>> is a lot of history and a lot of older drivers do not always do the right thing. >>> >>> With this RFC I would like to get a consensus of how it should work. After that >>> I want to implement any missing bits and improve the documentation, and finally >>> go through the drivers and at least try to make them behave consistently. >>> >>> Also I want to improve v4l2-compliance to test more corner cases, especially >>> if you use CREATE_BUFS instead of REQBUFS (I already have a patch for that >>> ready). >>> >>> The work Benjamin did on increasing the max number of supported buffers and the >>> REMOVE_BUFS ioctl uncovered a lot of that messy history, and it is clear we need >>> to try and clarify how it should work. >>> >>>>> The second proposal is to explicitly document that the V4L2_CID_MIN_BUFFERS_FOR_CAPTURE/OUTPUT >>>>> are for stateful codec support only, at least for now. >>> >>> I just discovered that v4l2-compliance and v4l2-ctl do not honor these controls >>> for stateful codecs. That's something that needs to be fixed. >>> >>> There is also one other item that I would like to discuss: the vb2 queue_setup >>> callback is currently used for both REQBUFS and CREATE_BUFS, and it remains >>> confusing for drivers how to use it exactly. I am inclined to redesign that >>> part, most likely splitting it in two: either one callback for REQBUFS and one >>> for CREATE_BUFS, or alternatively, one callback when allocating buffers for >>> the first time (so REQBUFS and when CREATE_BUFS is called for the first time, >>> i.e. when no buffers are allocated yet), and one callback when adding additional >>> buffers. I would have to think about this, and probably experiment a bit. >> >> Actually, this really has to be addressed since this is broken: you can call >> CREATE_BUFS as a replacement for REQBUFS, but it will act like REQBUFS and >> the requested sizes are not honored. > > After digging deeper into this I realized that it actually works correctly > as long as the driver doesn't do anything weird in queue_setup. > > E.g. I am not so sure the venus driver will do this right, see > drivers/media/platform/qcom/venus/vdec.c > > If *num_planes is non-zero (i.e. called from CREATE_BUFS), then it will use > e.g. inst->input_buf_size, and that is only set if queue_setup is called with > *num_planes set to zero (i.e. called from REQBUFS). > inst->input_buf_size gets updated in S_FMT[1] so it will be available even if REQBUF is not called. And in queue_setup (for CREATE_BUF) driver is validating the size of buffers being allocated. Anything additional is expected from driver here? Also, in my understanding, the original concern of not having a way to notify client about the minimum buffer requirement, in case of CREATE_BUFS still exist. [1]https://elixir.bootlin.com/linux/v6.12-rc5/source/drivers/media/platform/qcom/venus/vdec.c#L365 Thanks, Dikshita > So I suspect that calling CREATE_BUFS instead of REQBUFS will cause problems. > > I'm CC-ing Stanimir and Dikshita (for the upcoming iris driver, which has the > same issue). > > It is rare that drivers do something like that in queue_setup, so most drivers > should be fine. If a driver needs to do something like that, then they should > first call vb2_get_num_buffers() and check if this is 0: that's when you can > do the extra code that only should be done when the first buffer is allocated. > > Regards, > > Hans > >> >> I added tests for this to v4l2-compliance (locally only), and it fails on >> everything. >> >> It should not be news to anyone that I hate the CREATE_BUFS ioctl API. I posted >> an RFC for a VIDIOC_ADD_BUFS replacement earlier this year: >> >> https://lore.kernel.org/linux-media/243a66ad-6dff-4a43-ab03-e01d1038fe8a@xxxxxxxxx/ >> >> I wonder if we should restrict CREATE_BUFS to only be used after calling >> REQBUFS, and to a proper job for ADD_BUFS. Because given the vb2 design flaw >> I am not sure if it can be worked around. Or if we even want that. >> >> What a mess. >> >> Looking at the kernel history, CREATE_BUFS was added back in 2011 and the first >> very simple v4l2-compliance tests were added in 2012. >> >> Moral: whenever a new uAPI is added, make sure it you make really good compliance >> tests as well. >> >> Regards, >> >> Hans >> >>> >>> Regards, >>> >>> Hans >>> >>>>> >>>>> If this is in place, then min_reqbufs_allocation should be set to a sane number of >>>>> buffers (i.e. typically 3 or 4), and if you want precise control, use VIDIOC_CREATE_BUFS. >>>> >>> >>> >> >> > >