Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

Bjorn Andersson <quic_bjorande@xxxxxxxxxxx> · Tue, 1 Oct 2024 14:15:52 -0700

On Tue, Oct 01, 2024 at 12:06:17PM +0530, Mukesh Ojha wrote:
> On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote:
> > On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote:
> > > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> > > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > > > recover it results in NULL pointer dereference issue in
> > > > > qcom_glink_smem_unregister().
> > > > > 
> > > > > Fix it by having a NULL check in glink_subdev_stop().
> > > > > 
> > > > > 	Process-A                			Process-B
> > > > > 
> > > > >   fatal error interrupt happens
> > > > > 
> > > > >   rproc_crash_handler_work()
> > > > >     mutex_lock_interruptible(&rproc->lock);
> > > > >     ...
> > > > > 
> > > > >        rproc->state = RPROC_CRASHED;
> > > > >     ...
> > > > >     mutex_unlock(&rproc->lock);
> > > > > 
> > > > >     rproc_trigger_recovery()
> > > > >      mutex_lock_interruptible(&rproc->lock);
> > > > > 
> > > > >       adsp_stop()
> > > > >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > > > >       remoteproc remoteproc3: can't stop rproc: -22
> > > > 
> > > > I presume that at this point this remoteproc is in some undefined state
> > > > and the only way to recover is for the user to reboot the machine?
> > > 
> > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of
> > > remote processor memory region, and i don't think, it is anyway to do
> > > with remote processor state here, as a best effort more number of
> > > retries can be tried instead of 50 or wait for some other recovery
> > > command like recovery_store() to let it do the retry again from
> > > beginning.
> > > 
> > 
> > But are you saying that retrying a bit later would allow us to get out
> > of this problem? If we just didn't hit the NULL pointer(s)?
> 
> I am not sure whether adding more number of retries will solve the issue
> and initially thinking from perspective that, it is better to retry than
> to leave the remoteproc in some unknown state however, I do get that
> letting it retry could give unnecessary patching every code e.g., ssr
> notifier callbacks, which is not expecting to be called twice as a
> side-effect of remoteproc unknown state.

That's not what I'm asking you. When the remote processor fails to stop,
can you recover the system by just trying a bit later, or is the
remoteproc dead until reboot?

> > 
> > How long are we talking about here? Is the timeout too short?
> 
> 5sec is very significant amount of time in wait for remote processor to
> get recovered, we should not change this.

Okay, I'm just trying to understand the problem you're trying to solve.

Regards,
Bjorn

> > 
> > > > 
> > > > 
> > > > The check for glink->edge avoids one pitfall following this, but I'd
> > > > prefer to see a solution that avoids issues in this scenario in the
> > > > remoteproc core - rather than working around side effects of this in
> > > > different places.
> > > 
> > > Handling in a remoteproc core means we may need another state something
> > > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
> > > failure and checking the same during another try return immediately with
> > > some log message.
> > > 
> > 
> > Yes, if we are failing to shut down the remoteproc and there's no way
> > for us to reliably recover from that (e.g. we are not able to reclaim
> > the memory), it seems reasonable that we have to mark it using a new
> > state.
> > 
> > If that is the case, I'd call it RPROC_DEFUNCT (or something like that
> > instead), because while in some "unknown" state, from a remoteproc
> > framework's point of view, it's in a well known (broken) state.
> 
> Ack.
> 
> -Mukesh
> > 
> > Regards,
> > Bjorn