Re: [RFC]: media: mc: storing query status in variable length buffer likes blob in DRM

Hsia-Jun Li <Randy.Li@xxxxxxxxxxxxx> · Fri, 28 Jul 2023 16:05:20 +0800

On 7/28/23 14:23, Tomasz Figa wrote:
CAUTION: Email originated externally, do not click links or open attachments unless you recognize the sender and know the content is safe.

On Thu, Jul 27, 2023 at 7:10 PM Hsia-Jun Li <Randy.Li@xxxxxxxxxxxxx> wrote:

On 7/27/23 17:07, Tomasz Figa wrote:
CAUTION: Email originated externally, do not click links or open attachments unless you recognize the sender and know the content is safe.

On Tue, Jul 4, 2023 at 5:41 PM Hsia-Jun Li <Randy.Li@xxxxxxxxxxxxx> wrote:

Hello All

This RFC will address the problem that some ioctl() commands would be
called many times until that command got all the all its entries. Those
ioctl() command are:

- VIDIOC_ENUM_FMT

- VIDIOC_QUERYCTRL

- VIDIOC_ENUMSTD and VIDIOC_SUBDEV_ENUMSTD

Generally speaking, all enumeration type commands would lead to
frequently context switch between userspace  and kernel space. A few
enumeration commands listed below may not meet this problem in some
cases, as they could present their entries in a step wise way.

- VIDIOC_ENUM_FRAMESIZES

- VIDIOC_ENUM_FRAMEINTERVALS

Right, it's something that often comes in conversations with user
space developers on how it's difficult to query information from V4L2.
That said, we've mostly found it as an annoyance rather than a
practical problem. Do you have some specific case where the current
approach of enumeration causes a problem for the application (e.g.
performance)?

I have to admit the performance is not a problem for enumeration of
pixel formats. We only do that in the setup step.

The original problem before ext_pixel_format is enumeration of the
colorspace, 12 variant of colorimetries for a pixel format.

After we began to disccus the v4l2 ext APIs, I want to remove the
barrier that we can't update the v4l2 uAPI easily under the size limitation.

A simple solution that we could bring and improve from DRM is the blob
object(struct drm_property_blob).

We could extend the existing ioctl() in this way:

1. VIDIOC_ENUM_EXT_FMT would turn a blob id and the memory size
requirement that usespace should prepare

for storing.

2. Appication call VIDIOC_GETPROPBLOB with blob id and a userspace
pointer which should be enough for storing.

3. V4L2 framework fill the that userptr with context likes this:

struct v4l2_blob_prop {

__u32 version;

__u32 payload_size;

__u32 payload[];

};

4. The parsing of payload would depend on its version which
v4l2_blob_prop.version says, and each entry in the payload could be
variable length, likes this:

struct v4l2_ext_pix_mod_desc {

__u64 modifier;

__u64 allocate_hints; /* heap flags shard by DMA-heap */

__u32 num_of_planes;

__u32 plane_sizes[8];

__u32 colorimetry_flags;

};

struct v4l2_ext_pix_desc {

__u32 fourcc;

__u32 num_of_modifies;

struct v4l2_ext_pix_mod_desc[];

};

Since I'm not familiar with the DRM blob approach it might be an
obvious thing, but what happens when the application was written when
DRM doesn't have such design about API version.
let's say the version was 3, but the kernel it's running on now was
upgraded to version 4?

Nope, the API version is not selected by the userspace but kernel.
It is something likes the protocol version in wayland.

We would careful make a version of API but we don't not take the
responsibility for the back-compatible for the application.

I hope Linus is not seeing this. Linux UAPI backwards compatibility is
a hard guarantee and it's stated multiple times in the kernel
documentation, for example:

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.kernel.org_doc_html_latest_process_4.Coding.html-23regressions&d=DwIFaQ&c=7dfBJ8cXbWjhc0BhImu8wVIoUFmBzj1s88r8EGyM0UY&r=P4xb2_7biqBxD4LGGPrSV6j-jf3C3xlR7PXU-mLTeZE&m=j-KHqunPsFSon38RJcUfONx4RnFtWuIrHQz2FBzE0DonOxTnr2Z3rbIwjD6TCY0B&s=8LzYkGbuPdoo6WHcK34USQs_FQxzNdCXEUUAn4GtUTE&e=

I would be really surprised if DRM was an exception. Let me add
What I said is "DRM doesn't have such design about API version"
At least, there is EGL, they offer many definations for a variable's 
range. We don't have such good thing(People can't count thing like 
OpenMAX IL ,nor vulkan or VA-API).

Why we could support multiple API version of structure in the userspace.
What I thought is that, maintainers would review careful for a version 
of structure. While if we really need to expand the structure, we don't 
need a new ioctl.
Also there is a door for the vendor driver if they want to have their 
own in their product.
+Daniel Vetter as my source of truth regarding the DRM UAPI.

In this design, we could avoid the problem that we could hardly extend
our uAPI for V4L2, all the structure must be a fixed size.

Here are some options design that people want for this RFC:

1. Do we need to call the ioctl() command itself(likes
VIDIOC_ENUM_EXT_FMT) which let the driver to flush its internal property
cache or calling VIDIOC_GETPROPBLOB is enough?

Wouldn't the former make it thread-unsafe?

Why we need thread safe here? V4L2 don't offer such thing.

That's not really true. What is your statement based on?

I want to know when you poll a device that let you dqbuf a buffer. While 
only one buffer is in the done queue. Which thread would get the buffer 
while two threads are triggered to do the dqbuf?
All V4L2 drivers need to allow calling their ioctls from different
threads - in the simplest case by serializing them. Any call to a V4L2
ioctl must return a consistent result. In your case, one thread would
call VIDIOC_ENUM_EXT_FMT, another could then call VIDIOC_ENUM_EXT_CTRL
and then if the first one calls VIDIOC_GETPROPBLOB it would get
results for something completely different than originally requested.

Arguably, the current design of enumerations is not thread-safe
either, because one thread could read supported formats from 0 to n,
then another thread could change some other state of the device and
then the first thread would get potentially different results for
formats n+1 to N. That is a design problem of the API, though. Nobody
was expecting that the list of formats would ever change due to some
potentially unrelated state change when the API was being designed.
(Although that's addressed today by the SRC_FMT change event, making
the user space re-read the destination formats when it notices it.)
You could regard getting blob id and filling the buffer as a critical 
section.
If you did some to trigger the hardware which would update the driver 
state, nobody promise the control you were accessing would be the latest 
value.

I'd imagine something like this:

struct v4l2_property_array {
          __u32 version; // Could be equal to the length of the struct

It should not be. version means the structure of the payload.
          void __user *buf;
          __u32 len;

          __u32 num_formats;
          __u32 num_frame_sizes;
          __u32 num_modifiers;
          __u32 num_controls;
          __u32 num_menus;

         // More fields could be added here for more types of
information to query, with next versions, which could be ignored by
old userspace.
You know what this buffer for(the property that you want) before this
step. That is not necessary to have a union reporting the items in the
payload.
A struct(or an item) in the payload(many items) itself could have a
variable array. How many items in the payload may not help on parsing
for the userspace.

They don't have to have variable arrays if designed as such. Instead of having

struct v4l2_ext_pix_mod_desc modifiers[];
__u32 num_modifiers;

I prefer
{
__u32 fourcc;
...
__u32 num_modifiers;
__u64 modifiers[];
}
That is what DRM does. While that could be the layout of the payload. 
Not the property buffer's head.
in struct v4l2_ext_pix_desc, we can have the struct representing a
supported modifier refer to a pixel format by its fourcc. Sure, the
drawback would be repeating the same information for every <modifier,
fourcc> pair in the rest of the struct, but at least we don't need to
have variable arrays. Another approach would be to have an additional
association array of only <fourcc, modifier> pairs, like in databases.

The format in DRM would be like:
num_mod_pairs
<modifier, mask>[]
num_pixel_fmts
format[]
The mask for a modifier would tell which format support this modifier.
But I don't like that.
Because usaually RGB pixel formats would have only a few modifiers which 
is supported by the GPU.
YUV could have more cache hints or allocation requirements.
And video codec hardware wouldn't support so many pixel formats like DRM 
does.
My main goal is to make it possible to keep backwards compatibility of
returned data, as per the kernel UAPI development guidelines, with a
Usually those command structure for pixel description or buffer 
operation would not change once we settle.
Their layout is not that complex or know to something, maintainer could 
update them easily.
We still have the legacy of V4L1, such update won't cause the fragement 
in drivers. We could even put a new version of API in stagging for a 
while then apply it globally.
reasonable amount of effort and without complicating user space
programming too much. That said, we may still need to have some kind
of size variability in the structs, because otherwise we wouldn't be
able to extend them in the future with more fields, so my proposal
wouldn't work either. Need to think about it for a moment.

};

If out_buf is NULL, the kernel fills len and num_* fields and returns.
Then the userspace allocates a big enough buffer and calls the ioctl
again with buf != NULL, which fills in the buffer pointed by buf as
below.

The maximum requirement size would be confirmed at the first step, a
ioctl() likes VIDIOC_ENUM_EXT_FMT.
{
          (struct v4l2_ext_format[]) {
                 [0] = { ... },
                 [num_formats - 1] = { ... }
          }
          (struct v4l2_frame_size[]) {
                  [0] = { ... },
                  [num_frame_sizes - 1] = { ... }
          }
          // [...]
          (struct v4l2_ctrl_menu[]) {
                  [0] = { ... },
                  [num_menus - 1] = { ... }
          }
}

I think we could also let the user space fill in the version and num_
(presumably with 0) to control what information it wants the kernel to
return.

2. Should we make MC node support this feature only or standard video
node could? A thought from pinchartl is that every driver should have a
MC node even for the stateful driver.

The futuristic model that we envisioned back in time would be that
there is a media device that gives the user space ability to perform
operations on all the interfaces of the respective media controller
(or even all interfaces in the system - but that would require some
security considerations). In such a model, the userspace could prepare
an array (or blob if you prefer) of operations, where each operation
refers to an interface entity ID on which to execute the operation. It
would allow us to submit complete requests (or even many of them) in a
single ioctl, reducing the overhead significantly.

That said, it would be a complete overhaul of the API, so it might be
a step too big? Not sure. We certainly could benefit from V4L3. ;)

I thought of that. But I try to convince people those are not a big
change but essential.

Overhauling the entire querying side of the API is already quite a big
change. It shouldn't really matter if the change is small and big,
assuming a good enough justification is there (and the change meets
the kernel development requirements).

The implementation of RFC could be a foundation for ext pixel and ext
buffer APIs. I would like to know your feedback before we settle the
problem with the ext pixel format.

I feel like this kind of batch operation would be more important for
buffer management, because it could allow dequeuing multiple buffers
in one ioctl, which is actually something that is starting to show up
as a real performance bottleneck for complex devices, such as ISPsI think I make it for MC node from the idea of Pinchart.
(We have a MediaTek ISP that has a really big number of DMA engines,
which accounts to about 200 DQBUFs per frame. Let me add +Yunke Cao
I need to know how it happens.

I can't share the exact details, but basically it's an ISP, so it has
multiple different processing blocks that have their own inputs and
outputs. They can then be relatively freely arranged into a processing
pipeline.

who's been looking into what performance effects it has and how we
could improve it.)

We need to hurry up on settling the design of m2m or v4l2 framework.

Hurry is the biggest foe of API design. ;)
The last thing we want is an API that we'll have to replace again in a
few quarters.
I agree that we need to have people mobilized and invest more time
into modernizing the API, though.

We meet the fourcc namespace problem at least 5 years ago, but we still
didn't fix it.

What's the problem with the fourcc namespace?

Best regards,
Tomasz

--
Hsia-Jun(Randy) Li