Re: [PATCH v4] remoteproc: core: do pm relax when in RPROC_OFFLINE

Mathieu Poirier <mathieu.poirier@xxxxxxxxxx> · Fri, 21 Oct 2022 13:34:54 -0600

On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@xxxxxxxxxxx> wrote:
>
> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
> > On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
> >> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
> >>> Hi Mathieu,
> >>>
> >>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
> >>>> Please add what has changed from one version to another, either in a cover
> >>>> letter or after the "Signed-off-by".  There are many examples on how to do that
> >>>> on the mailing list.
> >>>>
> >>> Thx for the information, will take a note and benefit for next time.
> >>>
> >>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
> >>>>> RPROC_OFFLINE state indicate there is no recovery process
> >>>>> is in progress and no chance to do the pm_relax.
> >>>>> Because when recovering from crash, rproc->lock is held and
> >>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> >>>>> and then unlock rproc->lock.
> >>>>
> >>>> You are correct - because the lock is held rproc->state should be set to RPROC_RUNNING
> >>>> when rproc_trigger_recovery() returns.  If that is not the case then something
> >>>> went wrong.
> >>>>
> >>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before returning,
> >>>> so we know the remote processor was stopped.  Therefore if rproc->state is set
> >>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
> >>>> rproc_start().  Either way the remote processor is offline and the system probably
> >>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
> >>>> things along.
> >>>>
> >>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
> >>> finished.
> >>> Even if it is multi crash rproc_crash_handler_work contention issue, and
> >>> last rproc_trigger_recovery bailed out with only
> >>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
> >>> Since the subsystem may still can be recovered with customer's next trigger
> >>> of rproc_start, and we can make each error out path clean with pm resources.
> >>>
> >>>> I suggest spending time understanding what leads to the failure when recovering
> >>>> from a crash and address that problem(s).
> >>>>
> >>> In current case, the customer's information is that the issue happened when
> >>> rproc_shutdown is triggered at similar time. So not an issue from error out
> >>> of rproc_trigger_recovery.
> >>
> >> That is a very important element to consider and should have been mentioned from
> >> the beginning.  What I see happening is the following:
> >>
> >> rproc_report_crash()
> >>          pm_stay_awake()
> >>          queue_work() // current thread is suspended
> >>
> >> rproc_shutdown()
> >>          rproc_stop()
> >>                  rproc->state = RPROC_OFFLINE;
> >>
> >> rproc_crash_handler_work()
> >>          if (rproc->state == RPROC_OFFLINE)
> >>                  return // pm_relax() is not called
> >>
> >> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
> >> rproc_detach(), along with a very descriptive comment as to why it is needed.
> >
> > Thinking about this further there are more ramifications to consider.  Please
> > confirm the above scenario is what you are facing.  I will advise on how to move
> > forward if that is the case.
> >
> Not sure if the situation is clear or not. So resend the email again.
>
> The above senario is what customer is facing. crash hanppened while at
> the same time shutdown is triggered.

Unfortunately this is not enough details to address a problem as
complex as this one.

> And the device cannto goes to suspend state after that.
> the subsystem can still be start normally after this.

If the code flow I pasted above reflects the problem at hand, the
current patch will not be sufficient to address the issue.  If Arnaud
confirms my suspicions we will have to think about a better solution.

>
> >>
> >>
> >>>> Thanks,
> >>>> Mathieu
> >>>>
> >>>>
> >>>>> When the state is in RPROC_OFFLINE it means separate request
> >>>>> of rproc_stop was done and no need to hold the wakeup source
> >>>>> in crash handler to recover any more.
> >>>>>
> >>>>> Signed-off-by: Maria Yu <quic_aiquny@xxxxxxxxxxx>
> >>>>> ---
> >>>>>    drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
> >>>>>    1 file changed, 11 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> >>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
> >>>>> --- a/drivers/remoteproc/remoteproc_core.c
> >>>>> +++ b/drivers/remoteproc/remoteproc_core.c
> >>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct work_struct *work)
> >>>>>           if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
> >>>>>                   /* handle only the first crash detected */
> >>>>>                   mutex_unlock(&rproc->lock);
> >>>>> +         /*
> >>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
> >>>>> +          * is in progress and no chance to have pm_relax in place.
> >>>>> +          * Because when recovering from crash, rproc->lock is held and
> >>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
> >>>>> +          * and then unlock rproc->lock.
> >>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
> >>>>> +          * process.
> >>>>> +          */
> >>>>> +         if (rproc->state == RPROC_OFFLINE)
> >>>>> +                 pm_relax(rproc->dev.parent);
> >>>>>                   return;
> >>>>>           }
> >>>>> --
> >>>>> 2.7.4
> >>>>>
> >>>
> >>>
> >>> --
> >>> Thx and BRs,
> >>> Aiqun(Maria) Yu
>
>
> --
> Thx and BRs,
> Aiqun(Maria) Yu