Hi Nicolas, Thank you for your prompt reply. On ven., mars 01, 2024 at 09:01, Nicolas Dufresne <nicolas.dufresne@xxxxxxxxxxxxx> wrote: > Hi, > > Le mercredi 28 février 2024 à 17:12 +0100, Mattijs Korpershoek a écrit : >> When aborting a stream, it's possible that v4l2_m2m_cancel_job() >> remains stuck: >> >> [ 3498.490038][ T1] sysrq: Show Blocked State >> [ 3498.511754][ T1] task:V4L2DecodeCompo state:D stack:12480 pid:2387 ppid:1 flags:0x04000809 >> [ 3498.521153][ T1] Call trace: >> [ 3498.524333][ T1] __switch_to+0x174/0x338 >> [ 3498.528611][ T1] __schedule+0x5ec/0x9cc >> [ 3498.532795][ T1] schedule+0x84/0xf0 >> [ 3498.536630][ T1] v4l2_m2m_cancel_job+0x118/0x194 >> [ 3498.541595][ T1] v4l2_m2m_streamoff+0x34/0x13c >> [ 3498.546384][ T1] v4l2_m2m_ioctl_streamoff+0x20/0x30 >> [ 3498.551607][ T1] v4l_streamoff+0x44/0x54 >> [ 3498.555877][ T1] __video_do_ioctl+0x388/0x4dc >> [ 3498.560580][ T1] video_usercopy+0x618/0xa0c >> [ 3498.565109][ T1] video_ioctl2+0x20/0x30 >> [ 3498.569292][ T1] v4l2_ioctl+0x74/0x8c >> [ 3498.573300][ T1] __arm64_sys_ioctl+0xb0/0xec >> [ 3498.577918][ T1] invoke_syscall+0x60/0x124 >> [ 3498.582368][ T1] el0_svc_common+0x90/0xfc >> [ 3498.586724][ T1] do_el0_svc+0x34/0xb8 >> [ 3498.590733][ T1] el0_svc+0x2c/0xa4 >> [ 3498.594480][ T1] el0t_64_sync_handler+0x68/0xb4 >> [ 3498.599354][ T1] el0t_64_sync+0x1a4/0x1a8 >> [ 3498.603832][ T1] sysrq: Kill All Tasks >> >> According to job_abort() documentation from v4l2_m2m_ops: >> >> After the driver performs the necessary steps, it has to call >> v4l2_m2m_job_finish() or v4l2_m2m_buf_done_and_job_finish() as >> if the transaction ended normally. >> >> This is not done in wave5_vpu_dec_job_abort(). Neither switch_state() nor >> wave5_vpu_dec_set_eos_on_firmware() does this. > > The doc said "the driver", not job_abort() specifically ... Indeed. Seems I wanted to convince myself that this was job_abort() specific. Sorry about that. > >> >> Add the missing call to fix the v4l2_m2m_cancel_job() hangs. >> >> Fixes: 9707a6254a8a ("media: chips-media: wave5: Add the v4l2 layer") >> Signed-off-by: Mattijs Korpershoek <mkorpershoek@xxxxxxxxxxxx> >> --- >> This has been tested on the AM62Px SK EVM using Android 14. >> See: >> https://www.ti.com/tool/PROCESSOR-SDK-AM62P >> >> Note: while this is has not been tested on an upstream linux tree, I >> believe the fix is still valid for mainline and I would like to get >> some feedback on it. >> >> Thank you in advance for your time. >> --- >> drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c >> index ef227af72348..b03e3633a1bc 100644 >> --- a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c >> +++ b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c >> @@ -1715,6 +1715,7 @@ static void wave5_vpu_dec_device_run(void *priv) >> static void wave5_vpu_dec_job_abort(void *priv) >> { >> struct vpu_instance *inst = priv; >> + struct v4l2_m2m_ctx *m2m_ctx = inst->v4l2_fh.m2m_ctx; >> int ret; >> >> ret = switch_state(inst, VPU_INST_STATE_STOP); >> @@ -1725,6 +1726,8 @@ static void wave5_vpu_dec_job_abort(void *priv) >> if (ret) >> dev_warn(inst->dev->dev, >> "Setting EOS for the bitstream, fail: %d\n", ret); >> + >> + v4l2_m2m_job_finish(inst->v4l2_m2m_dev, m2m_ctx); > > Wave5 firmware have no function to cancel pending jobs. By finishing the job > without caring about the firmware state, wave5_vpu_dec_finish_decode() will be > called concurrently to another job. This change will effectively breaks seeking, > and will cause warning and possibly memory corruption. Ah. That's not what I intended. This patch would completely break the driver. Thank you for explaining that to me. > > In principle, setting the EOS flag should be enough to ensure that the drain > sequence is initiated, and that should allow finish_decoder() to be called, > which is the only clean way to get finish_job() to be called. > > This driver implementation uses PIC_END operating mode of the IP, that ensure > that each submitted job is assumed to be complete, which means each RUN will > lead to a matching IRQ. We had a similar stall during development which was > fixed with a firmware update, are you sure you have the very latest firmware ? Interesting. I double checked the firmware from linux-firmware: $ ~/work/upstream/linux-firmware/ main md5sum cnm/wave521c_k3_codec_fw.bin 02c5faa5405559bd59a7361a32c2695a cnm/wave521c_k3_codec_fw.bin $ ~/ adb shell md5sum /vendor/firmware/cnm/wave521c_k3_codec_fw.bin 02c5faa5405559bd59a7361a32c2695a /vendor/firmware/cnm/wave521c_k3_codec_fw.bin Which should be: "FW version : 1.0.3" > Any chance you can use v4l2-tracer to share what your software have been doing ? I did not know v4l2-tracer. I tried to run it on android but it seems it segfaults when overriding mmap(). I will continue to try to get some traces. > > What you can though though, to fortify this a little, is to introduce a watchdog > here. You can trigger it from abort, and on timeout, you will have to fully > reset and reboot the chip (which is not very fast, you don't want to do this at > all time). When the reset have completed, you will have to carefully reset the > driver state before you can safely continue. You'll need to add thread safe > protection against spurious finish_decode() call, in case the watchdog timeout > raced with the firmware finally coming back to life. Ok, I can look into that as well. Given that i'm not super familiar with multimedia, this has helped me a lot. Thanks! > > regards, > Nicolas > > p.s. you should perhaps trace the firmware job counters to try and understand > more about your specific hang. I will look into it. > >> } >> >> static int wave5_vpu_dec_job_ready(void *priv) >> >> --- >> base-commit: 8c64f4cdf4e6cc5682c52523713af8c39c94e6d5 >> change-id: 20240228-wave5-fix-abort-f72d25881cbd >> >> Best regards,