Re: [RFC PATCH V4 1/4] media: v4l2-mem2mem: add v4l2_m2m_suspend, v4l2_m2m_resume

Tomasz Figa <tfiga@xxxxxxxxxxxx> · Wed, 10 Jun 2020 21:26:28 +0200

On Wed, Jun 10, 2020 at 9:14 PM Ezequiel Garcia
<ezequiel@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, 10 Jun 2020 at 16:03, Tomasz Figa <tfiga@xxxxxxxxxxxx> wrote:
> >
> > On Wed, Jun 10, 2020 at 03:52:39PM -0300, Ezequiel Garcia wrote:
> > > Hi everyone,
> > >
> > > Thanks for the patch.
> > >
> > > On Wed, 10 Jun 2020 at 07:33, Tomasz Figa <tfiga@xxxxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Jun 10, 2020 at 12:29 PM Hans Verkuil <hverkuil@xxxxxxxxx> wrote:
> > > > >
> > > > > On 21/05/2020 19:11, Tomasz Figa wrote:
> > > > > > Hi Jerry,
> > > > > >
> > > > > > On Wed, Dec 04, 2019 at 08:47:29PM +0800, Jerry-ch Chen wrote:
> > > > > >> From: Pi-Hsun Shih <pihsun@xxxxxxxxxxxx>
> > > > > >>
> > > > > >> Add two functions that can be used to stop new jobs from being queued /
> > > > > >> continue running queued job. This can be used while a driver using m2m
> > > > > >> helper is going to suspend / wake up from resume, and can ensure that
> > > > > >> there's no job running in suspend process.
> > [snip]
> > > > >
> > > > > I assume this will be part of a future patch series that calls these new functions?
> > > >
> > > > The mtk-jpeg encoder series depends on this patch as well, so I guess
> > > > it would go together with whichever is ready first.
> > > >
> > > > I would also envision someone changing the other existing drivers to
> > > > use the helpers, as I'm pretty much sure some of them don't handle
> > > > suspend/resume correctly.
> > > >
> > >
> > > This indeed looks very good. If I understood the issue properly,
> > > the change would be useful for both stateless (e.g. hantro, et al)
> > > and stateful (e.g. coda) codecs.
> > >
> > > Hantro uses pm_runtime_force_suspend, and I believe that
> > > could is enough for proper suspend/resume operation.
> >
> > Unfortunately, no. :(
> >
> > If the decoder is already decoding a frame, that would forcefully power
> > off the hardware and possibly even cause a system lockup if we are
> > unlucky to gate a clock in the middle of a bus transaction.
> >
>
> pm_runtime_force_suspend calls pm_runtime_disable, which
> says:
>
> """
>  Increment power.disable_depth for the device and if it was zero previously,
>  cancel all pending runtime PM requests for the device and wait for all
>  operations in progress to complete.
> """
>
> Doesn't this mean it waits for the current job (if there is one) and
> prevents any new jobs to be issued?
>

I'd love if the PM runtime subsystem handled job management of all the
driver subsystems automatically, but at the moment it's not aware of
any jobs. :) The description says as much as it says - it stops any
internal jobs of the PM subsystem - i.e. asynchronous suspend/resume
requests. It doesn't have any awareness of V4L2 M2M jobs.

> > I just inspected the code now and actually found one more bug in its
> > power management handling. device_run() calls clk_bulk_enable() before
> > pm_runtime_get_sync(), but only the latter is guaranteed to actually
> > power on the relevant power domains, so we end up clocking unpowered
> > hardware.
> >
>
> How about we just move clk_enable/disable to runtime PM?
>
> Since we use autosuspend delay, it theoretically has
> some impact, which is why I was refraining from doing so.
>
> I can't decide if this impact would be marginal or significant.
>

I'd also refrain from doing this. Clock gating corresponds to the
bigger part of the power savings from runtime power management, since
it stops the dynamic power consumption and only leaves the static
leakage. That said, the Hantro IP blocks have some internal clock
gating as well, so it might not be as pronounced, depending on the
custom vendor integration logic surrounding the Hantro hardware.

Actually even if autosuspend is not used, the runtime PM subsystem has
some internal back-off mechanism based on measured power on and power
off latencies. The driver should call pm_runtime_get_sync() first and
then enable any necessary clocks. I can see that currently inside the
resume callback we have some hardware accesses. If those really need
to be there, they should be surrounded with appropriate clock enable
and clock disable calls.

> > >
> > > I'm not seeing any code in CODA to handle this, so not sure
> > > how it's handling suspend/resume.
> > >
> > > Maybe we can have CODA as the first user, given it's a well-maintained
> > > driver and should be fairly easy to test.
> >
> > I remember checking a number of drivers using the m2m helpers randomly
> > and none of them implemented suspend/resume correctly. I suppose that
> > was not discovered because normally the userspace itself would stop the
> > operation before the system is suspended, although it's not an API
> > guarantee.
> >
>
> Indeed. Do you have any recomendations for how we could
> test this case to make sure we are handling it correctly?

I'd say that a simple offscreen command line gstreamer/ffmpeg decode
with suspend/resume loop in another session should be able to trigger
some issues.

Best regards,
Tomasz