Re: [PATCH v2 2/2] media: imx: vdic: Introduce mem2mem VDI deinterlacer driver

Nicolas Dufresne <nicolas@xxxxxxxxxxxx> · Tue, 15 Oct 2024 13:46:13 -0400

Le mardi 24 septembre 2024 à 17:28 +0200, Marek Vasut a écrit :
> On 9/6/24 11:01 AM, Philipp Zabel wrote:
> 
> Hi,
> 
> > > diff --git a/drivers/staging/media/imx/imx-media-dev.c b/drivers/staging/media/imx/imx-media-dev.c
> > > index be54dca11465d..a841fdb4c2394 100644
> > > --- a/drivers/staging/media/imx/imx-media-dev.c
> > > +++ b/drivers/staging/media/imx/imx-media-dev.c
> > > @@ -57,7 +57,52 @@ static int imx6_media_probe_complete(struct v4l2_async_notifier *notifier)
> > >   		goto unlock;
> > >   	}
> > >   
> > > +	imxmd->m2m_vdic[0] = imx_media_mem2mem_vdic_init(imxmd, 0);
> > > +	if (IS_ERR(imxmd->m2m_vdic[0])) {
> > > +		ret = PTR_ERR(imxmd->m2m_vdic[0]);
> > > +		imxmd->m2m_vdic[0] = NULL;
> > > +		goto unlock;
> > > +	}
> > > +
> > > +	/* MX6S/DL has one IPUv3, init second VDI only on MX6Q/QP */
> > > +	if (imxmd->ipu[1]) {
> > > +		imxmd->m2m_vdic[1] = imx_media_mem2mem_vdic_init(imxmd, 1);
> > > +		if (IS_ERR(imxmd->m2m_vdic[1])) {
> > > +			ret = PTR_ERR(imxmd->m2m_vdic[1]);
> > > +			imxmd->m2m_vdic[1] = NULL;
> > > +			goto uninit_vdi0;
> > > +		}
> > > +	}
> > 
> > Instead of presenting two devices to userspace, it would be better to
> > have a single video device that can distribute work to both IPUs.
> 
> Why do you think so ?
> 
> I think it is better to keep the kernel code as simple as possible, i.e. 
> provide the device node for each m2m device to userspace and handle the 
> m2m device hardware interaction in the kernel driver, but let userspace 
> take care of policy like job scheduling, access permissions assignment 
> to each device (e.g. if different user accounts should have access to 
> different VDICs), or other such topics.

We have run through this topic already for multi-core stateless CODECs. It is
preferable to schedule interchangeable cores inside the Linux kernel.
> 
> > To be fair, we never implemented that for the CSC/scaler mem2mem device
> > either.
> 
> I don't think that is actually a good idea. Instead, it would be better 
> to have two scaler nodes in userspace.

It is impossible for userspace to properly dispatch the work and ensure maximal
performance across multiple process. A long as there is no state that can reside
on the chip of course.

Nicolas

> 
> [...]
> 
> > > +++ b/drivers/staging/media/imx/imx-media-mem2mem-vdic.c
> > > @@ -0,0 +1,997 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +/*
> > > + * i.MX VDIC mem2mem de-interlace driver
> > > + *
> > > + * Copyright (c) 2024 Marek Vasut <marex@xxxxxxx>
> > > + *
> > > + * Based on previous VDIC mem2mem work by Steve Longerbeam that is:
> > > + * Copyright (c) 2018 Mentor Graphics Inc.
> > > + */
> > > +
> > > +#include <linux/delay.h>
> > > +#include <linux/fs.h>
> > > +#include <linux/module.h>
> > > +#include <linux/sched.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/version.h>
> > > +
> > > +#include <media/media-device.h>
> > > +#include <media/v4l2-ctrls.h>
> > > +#include <media/v4l2-device.h>
> > > +#include <media/v4l2-event.h>
> > > +#include <media/v4l2-ioctl.h>
> > > +#include <media/v4l2-mem2mem.h>
> > > +#include <media/videobuf2-dma-contig.h>
> > > +
> > > +#include "imx-media.h"
> > > +
> > > +#define fh_to_ctx(__fh)	container_of(__fh, struct ipu_mem2mem_vdic_ctx, fh)
> > > +
> > > +#define to_mem2mem_priv(v) container_of(v, struct ipu_mem2mem_vdic_priv, vdev)
> > 
> > These could be inline functions for added type safety.
> 
> Fixed in v3
> 
> [...]
> 
> > > +static void ipu_mem2mem_vdic_device_run(void *_ctx)
> > > +{
> > > +	struct ipu_mem2mem_vdic_ctx *ctx = _ctx;
> > > +	struct ipu_mem2mem_vdic_priv *priv = ctx->priv;
> > > +	struct vb2_v4l2_buffer *curr_buf, *dst_buf;
> > > +	dma_addr_t prev_phys, curr_phys, out_phys;
> > > +	struct v4l2_pix_format *infmt;
> > > +	u32 phys_offset = 0;
> > > +	unsigned long flags;
> > > +
> > > +	infmt = ipu_mem2mem_vdic_get_format(priv, V4L2_BUF_TYPE_VIDEO_OUTPUT);
> > > +	if (V4L2_FIELD_IS_SEQUENTIAL(infmt->field))
> > > +		phys_offset = infmt->sizeimage / 2;
> > > +	else if (V4L2_FIELD_IS_INTERLACED(infmt->field))
> > > +		phys_offset = infmt->bytesperline;
> > > +	else
> > > +		dev_err(priv->dev, "Invalid field %d\n", infmt->field);
> > > +
> > > +	dst_buf = v4l2_m2m_next_dst_buf(ctx->fh.m2m_ctx);
> > > +	out_phys = vb2_dma_contig_plane_dma_addr(&dst_buf->vb2_buf, 0);
> > > +
> > > +	curr_buf = v4l2_m2m_next_src_buf(ctx->fh.m2m_ctx);
> > > +	if (!curr_buf) {
> > > +		dev_err(priv->dev, "Not enough buffers\n");
> > > +		return;
> > > +	}
> > > +
> > > +	spin_lock_irqsave(&priv->irqlock, flags);
> > > +
> > > +	if (ctx->curr_buf) {
> > > +		ctx->prev_buf = ctx->curr_buf;
> > > +		ctx->curr_buf = curr_buf;
> > > +	} else {
> > > +		ctx->prev_buf = curr_buf;
> > > +		ctx->curr_buf = curr_buf;
> > > +		dev_warn(priv->dev, "Single-buffer mode, fix your userspace\n");
> > > +	}
> > > +
> > > +	prev_phys = vb2_dma_contig_plane_dma_addr(&ctx->prev_buf->vb2_buf, 0);
> > > +	curr_phys = vb2_dma_contig_plane_dma_addr(&ctx->curr_buf->vb2_buf, 0);
> > > +
> > > +	priv->curr_ctx = ctx;
> > > +	spin_unlock_irqrestore(&priv->irqlock, flags);
> > > +
> > > +	ipu_cpmem_set_buffer(priv->vdi_out_ch,  0, out_phys);
> > > +	ipu_cpmem_set_buffer(priv->vdi_in_ch_p, 0, prev_phys + phys_offset);
> > > +	ipu_cpmem_set_buffer(priv->vdi_in_ch,   0, curr_phys);
> > > +	ipu_cpmem_set_buffer(priv->vdi_in_ch_n, 0, curr_phys + phys_offset);
> > 
> > This always outputs at a frame rate of half the field rate, and only
> > top fields are ever used as current field, and bottom fields as
> > previous/next fields, right?
> 
> Yes, currently the driver extracts 1 frame from two consecutive incoming 
> fields (previous Bottom, and current Top and Bottom):
> 
> (frame 1 and 3 below is omitted)
> 
>      1  2  3  4
> ...|T |T |T |T |...
> ...| B| B| B| B|...
>       | ||  | ||
>       '-''  '-''
>        ||    ||
>        ||    \/
>        \/  Frame#4
>      Frame#2
> 
> As far as I understand it, this is how the current VDI implementation 
> behaves too, right ?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/media/imx/imx-media-vdic.c#n207
> 
> > I think it would be good to add a mode that doesn't drop the
> > 
> > 	ipu_cpmem_set_buffer(priv->vdi_in_ch_p, 0, prev_phys);
> > 	ipu_cpmem_set_buffer(priv->vdi_in_ch,   0, prev_phys + phys_offset);
> > 	ipu_cpmem_set_buffer(priv->vdi_in_ch_n, 0, curr_phys);
> > 
> > output frames, right from the start.
> 
> This would make the VDI act as a frame-rate doubler, which would spend a 
> lot more memory bandwidth, which is limited on MX6, so I would also like 
> to have a frame-drop mode (i.e. current behavior).
> 
> Can we make that behavior configurable ? Since this is a mem2mem device, 
> we do not really have any notion of input and output frame-rate, so I 
> suspect this would need some VIDIOC_* ioctl ?
> 
> > If we don't start with that supported, I fear userspace will make
> > assumptions and be surprised when a full rate mode is added later.
> 
> I'm afraid that since the current VDI already does retain input frame 
> rate instead of doubling it, the userspace already makes an assumption, 
> so that ship has sailed.
> 
> But I think we can make the frame doubling configurable ?
> 
> > > +	/* No double buffering, always pick buffer 0 */
> > > +	ipu_idmac_select_buffer(priv->vdi_out_ch, 0);
> > > +	ipu_idmac_select_buffer(priv->vdi_in_ch_p, 0);
> > > +	ipu_idmac_select_buffer(priv->vdi_in_ch, 0);
> > > +	ipu_idmac_select_buffer(priv->vdi_in_ch_n, 0);
> > > +
> > > +	/* Enable the channels */
> > > +	ipu_idmac_enable_channel(priv->vdi_out_ch);
> > > +	ipu_idmac_enable_channel(priv->vdi_in_ch_p);
> > > +	ipu_idmac_enable_channel(priv->vdi_in_ch);
> > > +	ipu_idmac_enable_channel(priv->vdi_in_ch_n);
> > > +}
> 
> [...]
> 
> > > +static int ipu_mem2mem_vdic_setup_hardware(struct ipu_mem2mem_vdic_priv *priv)
> > > +{
> > > +	struct v4l2_pix_format *infmt, *outfmt;
> > > +	struct ipu_ic_csc csc;
> > > +	bool in422, outyuv;
> > > +	int ret;
> > > +
> > > +	infmt = ipu_mem2mem_vdic_get_format(priv, V4L2_BUF_TYPE_VIDEO_OUTPUT);
> > > +	outfmt = ipu_mem2mem_vdic_get_format(priv, V4L2_BUF_TYPE_VIDEO_CAPTURE);
> > > +	in422 = ipu_mem2mem_vdic_format_is_yuv422(infmt->pixelformat);
> > > +	outyuv = ipu_mem2mem_vdic_format_is_yuv(outfmt->pixelformat);
> > > +
> > > +	ipu_vdi_setup(priv->vdi, in422, infmt->width, infmt->height);
> > > +	ipu_vdi_set_field_order(priv->vdi, V4L2_STD_UNKNOWN, infmt->field);
> > > +	ipu_vdi_set_motion(priv->vdi, HIGH_MOTION);
> > 
> > This maps to VDI_C_MOT_SEL_FULL aka VDI_MOT_SEL=2, which is documented
> > as "full motion, only vertical filter is used". Doesn't this completely
> > ignore the previous/next fields and only use the output of the di_vfilt
> > four tap vertical filter block to fill in missing lines from the
> > surrounding pixels (above and below) of the current field?
> 
> Is there a suitable knob for this or shall I introduce a device specific 
> one, like the vdic_ctrl_motion_menu for the current VDIC direct driver ?
> 
> If we introduce such a knob, then it is all the more reason to provide 
> one device node per one VDIC hardware instance, since each can be 
> configured for different motion settings.
> 
> > I think this should at least be configurable, and probably default to
> > MED_MOTION.
> 
> I think to be compatible with the current VDI behavior and to reduce 
> memory bandwidth usage, let's default to the HIGH/full mode. That one 
> produces reasonably good results without spending too much memory 
> bandwidth which is constrained already on the MX6, and if the user needs 
> better image quality, they can configure another mode using the V4L2 
> control.
> 
> [...]
>