Re: [PATCH] media: chips-media: wave5: Call v4l2_m2m_job_finish() in job_abort()

Mattijs Korpershoek <mkorpershoek@xxxxxxxxxxxx> · Mon, 04 Mar 2024 17:22:48 +0100

Hi Nicolas,

Thank you for your prompt reply.

On ven., mars 01, 2024 at 09:01, Nicolas Dufresne <nicolas.dufresne@xxxxxxxxxxxxx> wrote:

> Hi,
>
> Le mercredi 28 février 2024 à 17:12 +0100, Mattijs Korpershoek a écrit :
>> When aborting a stream, it's possible that v4l2_m2m_cancel_job()
>> remains stuck:
>> 
>> [ 3498.490038][    T1] sysrq: Show Blocked State
>> [ 3498.511754][    T1] task:V4L2DecodeCompo state:D stack:12480 pid:2387  ppid:1      flags:0x04000809
>> [ 3498.521153][    T1] Call trace:
>> [ 3498.524333][    T1]  __switch_to+0x174/0x338
>> [ 3498.528611][    T1]  __schedule+0x5ec/0x9cc
>> [ 3498.532795][    T1]  schedule+0x84/0xf0
>> [ 3498.536630][    T1]  v4l2_m2m_cancel_job+0x118/0x194
>> [ 3498.541595][    T1]  v4l2_m2m_streamoff+0x34/0x13c
>> [ 3498.546384][    T1]  v4l2_m2m_ioctl_streamoff+0x20/0x30
>> [ 3498.551607][    T1]  v4l_streamoff+0x44/0x54
>> [ 3498.555877][    T1]  __video_do_ioctl+0x388/0x4dc
>> [ 3498.560580][    T1]  video_usercopy+0x618/0xa0c
>> [ 3498.565109][    T1]  video_ioctl2+0x20/0x30
>> [ 3498.569292][    T1]  v4l2_ioctl+0x74/0x8c
>> [ 3498.573300][    T1]  __arm64_sys_ioctl+0xb0/0xec
>> [ 3498.577918][    T1]  invoke_syscall+0x60/0x124
>> [ 3498.582368][    T1]  el0_svc_common+0x90/0xfc
>> [ 3498.586724][    T1]  do_el0_svc+0x34/0xb8
>> [ 3498.590733][    T1]  el0_svc+0x2c/0xa4
>> [ 3498.594480][    T1]  el0t_64_sync_handler+0x68/0xb4
>> [ 3498.599354][    T1]  el0t_64_sync+0x1a4/0x1a8
>> [ 3498.603832][    T1] sysrq: Kill All Tasks
>> 
>> According to job_abort() documentation from v4l2_m2m_ops:
>> 
>>   After the driver performs the necessary steps, it has to call
>>   v4l2_m2m_job_finish() or v4l2_m2m_buf_done_and_job_finish() as
>>   if the transaction ended normally.
>> 
>> This is not done in wave5_vpu_dec_job_abort(). Neither switch_state() nor
>> wave5_vpu_dec_set_eos_on_firmware() does this.
>
> The doc said "the driver", not job_abort() specifically ...

Indeed. Seems I wanted to convince myself that this was job_abort()
specific. Sorry about that.

>
>> 
>> Add the missing call to fix the v4l2_m2m_cancel_job() hangs.
>> 
>> Fixes: 9707a6254a8a ("media: chips-media: wave5: Add the v4l2 layer")
>> Signed-off-by: Mattijs Korpershoek <mkorpershoek@xxxxxxxxxxxx>
>> ---
>> This has been tested on the AM62Px SK EVM using Android 14.
>> See:
>>     https://www.ti.com/tool/PROCESSOR-SDK-AM62P
>> 
>> Note: while this is has not been tested on an upstream linux tree, I
>> believe the fix is still valid for mainline and I would like to get
>> some feedback on it.
>> 
>> Thank you in advance for your time.
>> ---
>>  drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
>> index ef227af72348..b03e3633a1bc 100644
>> --- a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
>> +++ b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
>> @@ -1715,6 +1715,7 @@ static void wave5_vpu_dec_device_run(void *priv)
>>  static void wave5_vpu_dec_job_abort(void *priv)
>>  {
>>  	struct vpu_instance *inst = priv;
>> +	struct v4l2_m2m_ctx *m2m_ctx = inst->v4l2_fh.m2m_ctx;
>>  	int ret;
>>  
>>  	ret = switch_state(inst, VPU_INST_STATE_STOP);
>> @@ -1725,6 +1726,8 @@ static void wave5_vpu_dec_job_abort(void *priv)
>>  	if (ret)
>>  		dev_warn(inst->dev->dev,
>>  			 "Setting EOS for the bitstream, fail: %d\n", ret);
>> +
>> +	v4l2_m2m_job_finish(inst->v4l2_m2m_dev, m2m_ctx);
>
> Wave5 firmware have no function to cancel pending jobs. By finishing the job
> without caring about the firmware state, wave5_vpu_dec_finish_decode() will be
> called concurrently to another job. This change will effectively breaks seeking,
> and will cause warning and possibly memory corruption.

Ah. That's not what I intended. This patch would completely break the
driver.
Thank you for explaining that to me.

>
> In principle, setting the EOS flag should be enough to ensure that the drain
> sequence is initiated, and that should allow finish_decoder() to be called,
> which is the only clean way to get finish_job() to be called.

>
> This driver implementation uses PIC_END operating mode of the IP, that ensure
> that each submitted job is assumed to be complete, which means each RUN will
> lead to a matching IRQ. We had a similar stall during development which was
> fixed with a firmware update, are you sure you have the very latest firmware ?

Interesting.

I double checked the firmware from linux-firmware:

$ ~/work/upstream/linux-firmware/ main md5sum cnm/wave521c_k3_codec_fw.bin
02c5faa5405559bd59a7361a32c2695a  cnm/wave521c_k3_codec_fw.bin

$ ~/ adb shell md5sum /vendor/firmware/cnm/wave521c_k3_codec_fw.bin
02c5faa5405559bd59a7361a32c2695a  /vendor/firmware/cnm/wave521c_k3_codec_fw.bin

Which should be: "FW version : 1.0.3"

> Any chance you can use v4l2-tracer to share what your software have been doing ?

I did not know v4l2-tracer.
I tried to run it on android but it seems it segfaults when overriding
mmap().
I will continue to try to get some traces.

>
> What you can though though, to fortify this a little, is to introduce a watchdog
> here. You can trigger it from abort, and on timeout, you will have to fully
> reset and reboot the chip (which is not very fast, you don't want to do this at
> all time). When the reset have completed, you will have to carefully reset the
> driver state before you can safely continue. You'll need to add thread safe
> protection against spurious finish_decode() call, in case the watchdog timeout
> raced with the firmware finally coming back to life.

Ok, I can look into that as well.

Given that i'm not super familiar with multimedia, this has helped me a
lot.

Thanks!

>
> regards,
> Nicolas
>
> p.s. you should perhaps trace the firmware job counters to try and understand
> more about your specific hang.

I will look into it.

>
>>  }
>>  
>>  static int wave5_vpu_dec_job_ready(void *priv)
>> 
>> ---
>> base-commit: 8c64f4cdf4e6cc5682c52523713af8c39c94e6d5
>> change-id: 20240228-wave5-fix-abort-f72d25881cbd
>> 
>> Best regards,