Re: [PATCH] media: chips-media: wave5: Call v4l2_m2m_job_finish() in job_abort()

Nicolas Dufresne <nicolas.dufresne@xxxxxxxxxxxxx> · Fri, 01 Mar 2024 09:01:35 -0500

Hi,

Le mercredi 28 février 2024 à 17:12 +0100, Mattijs Korpershoek a écrit :
> When aborting a stream, it's possible that v4l2_m2m_cancel_job()
> remains stuck:
> 
> [ 3498.490038][    T1] sysrq: Show Blocked State
> [ 3498.511754][    T1] task:V4L2DecodeCompo state:D stack:12480 pid:2387  ppid:1      flags:0x04000809
> [ 3498.521153][    T1] Call trace:
> [ 3498.524333][    T1]  __switch_to+0x174/0x338
> [ 3498.528611][    T1]  __schedule+0x5ec/0x9cc
> [ 3498.532795][    T1]  schedule+0x84/0xf0
> [ 3498.536630][    T1]  v4l2_m2m_cancel_job+0x118/0x194
> [ 3498.541595][    T1]  v4l2_m2m_streamoff+0x34/0x13c
> [ 3498.546384][    T1]  v4l2_m2m_ioctl_streamoff+0x20/0x30
> [ 3498.551607][    T1]  v4l_streamoff+0x44/0x54
> [ 3498.555877][    T1]  __video_do_ioctl+0x388/0x4dc
> [ 3498.560580][    T1]  video_usercopy+0x618/0xa0c
> [ 3498.565109][    T1]  video_ioctl2+0x20/0x30
> [ 3498.569292][    T1]  v4l2_ioctl+0x74/0x8c
> [ 3498.573300][    T1]  __arm64_sys_ioctl+0xb0/0xec
> [ 3498.577918][    T1]  invoke_syscall+0x60/0x124
> [ 3498.582368][    T1]  el0_svc_common+0x90/0xfc
> [ 3498.586724][    T1]  do_el0_svc+0x34/0xb8
> [ 3498.590733][    T1]  el0_svc+0x2c/0xa4
> [ 3498.594480][    T1]  el0t_64_sync_handler+0x68/0xb4
> [ 3498.599354][    T1]  el0t_64_sync+0x1a4/0x1a8
> [ 3498.603832][    T1] sysrq: Kill All Tasks
> 
> According to job_abort() documentation from v4l2_m2m_ops:
> 
>   After the driver performs the necessary steps, it has to call
>   v4l2_m2m_job_finish() or v4l2_m2m_buf_done_and_job_finish() as
>   if the transaction ended normally.
> 
> This is not done in wave5_vpu_dec_job_abort(). Neither switch_state() nor
> wave5_vpu_dec_set_eos_on_firmware() does this.

The doc said "the driver", not job_abort() specifically ...

> 
> Add the missing call to fix the v4l2_m2m_cancel_job() hangs.
> 
> Fixes: 9707a6254a8a ("media: chips-media: wave5: Add the v4l2 layer")
> Signed-off-by: Mattijs Korpershoek <mkorpershoek@xxxxxxxxxxxx>
> ---
> This has been tested on the AM62Px SK EVM using Android 14.
> See:
>     https://www.ti.com/tool/PROCESSOR-SDK-AM62P
> 
> Note: while this is has not been tested on an upstream linux tree, I
> believe the fix is still valid for mainline and I would like to get
> some feedback on it.
> 
> Thank you in advance for your time.
> ---
>  drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
> index ef227af72348..b03e3633a1bc 100644
> --- a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
> +++ b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
> @@ -1715,6 +1715,7 @@ static void wave5_vpu_dec_device_run(void *priv)
>  static void wave5_vpu_dec_job_abort(void *priv)
>  {
>  	struct vpu_instance *inst = priv;
> +	struct v4l2_m2m_ctx *m2m_ctx = inst->v4l2_fh.m2m_ctx;
>  	int ret;
>  
>  	ret = switch_state(inst, VPU_INST_STATE_STOP);
> @@ -1725,6 +1726,8 @@ static void wave5_vpu_dec_job_abort(void *priv)
>  	if (ret)
>  		dev_warn(inst->dev->dev,
>  			 "Setting EOS for the bitstream, fail: %d\n", ret);
> +
> +	v4l2_m2m_job_finish(inst->v4l2_m2m_dev, m2m_ctx);

Wave5 firmware have no function to cancel pending jobs. By finishing the job
without caring about the firmware state, wave5_vpu_dec_finish_decode() will be
called concurrently to another job. This change will effectively breaks seeking,
and will cause warning and possibly memory corruption.

In principle, setting the EOS flag should be enough to ensure that the drain
sequence is initiated, and that should allow finish_decoder() to be called,
which is the only clean way to get finish_job() to be called.

This driver implementation uses PIC_END operating mode of the IP, that ensure
that each submitted job is assumed to be complete, which means each RUN will
lead to a matching IRQ. We had a similar stall during development which was
fixed with a firmware update, are you sure you have the very latest firmware ?
Any chance you can use v4l2-tracer to share what your software have been doing ?

What you can though though, to fortify this a little, is to introduce a watchdog
here. You can trigger it from abort, and on timeout, you will have to fully
reset and reboot the chip (which is not very fast, you don't want to do this at
all time). When the reset have completed, you will have to carefully reset the
driver state before you can safely continue. You'll need to add thread safe
protection against spurious finish_decode() call, in case the watchdog timeout
raced with the firmware finally coming back to life.

regards,
Nicolas

p.s. you should perhaps trace the firmware job counters to try and understand
more about your specific hang.

>  }
>  
>  static int wave5_vpu_dec_job_ready(void *priv)
> 
> ---
> base-commit: 8c64f4cdf4e6cc5682c52523713af8c39c94e6d5
> change-id: 20240228-wave5-fix-abort-f72d25881cbd
> 
> Best regards,