On Tue, Oct 15, 2024 at 9:57 PM Mukesh Ojha <quic_mojha@xxxxxxxxxxx> wrote: > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > if rproc_stop() fails from Process-A that leaves the rproc state to > RPROC_CRASHED state later a call to recovery_store from user space in > Process B triggers rproc_trigger_recovery() of the same remoteproc to > recover it results in NULL pointer dereference issue in > qcom_glink_smem_unregister(). > > There is other side to this issue if we want to fix this via adding a > NULL check on glink->edge which does not guarantees that the remoteproc > will recover in second call from Process B as it has failed in the first > Process A during SMC shutdown call and may again fail at the same call > and rproc can not recover for such case. What is the guarantee that the second stop also will fail? I feel it should be handled in user space, if rproc calls are failing then there is a bigger issue and then let userspace decide what to do if it is happening continuously. Also, why not add this DEFUNCT_STATE in other callbacks, as all callbacks from core to rproc driver can fail? > > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of Even if this state is present, ultimately it will be up to user space to decide what to do, right? > remoteproc and the only way to recover from it via system restart. > > Process-A Process-B > > fatal error interrupt happens > > rproc_crash_handler_work() > mutex_lock_interruptible(&rproc->lock); > ... > > rproc->state = RPROC_CRASHED; > ... > mutex_unlock(&rproc->lock); > > rproc_trigger_recovery() > mutex_lock_interruptible(&rproc->lock); > > adsp_stop() > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > remoteproc remoteproc3: can't stop rproc: -22 > mutex_unlock(&rproc->lock); > > echo enabled > /sys/class/remoteproc/remoteprocX/recovery > recovery_store() > rproc_trigger_recovery() > mutex_lock_interruptible(&rproc->lock); > rproc_stop() > glink_subdev_stop() > qcom_glink_smem_unregister() ==| > | > V > Unable to handle kernel NULL pointer dereference > at virtual address 0000000000000358 > > Signed-off-by: Mukesh Ojha <quic_mojha@xxxxxxxxxxx> > --- > Changes in v3: > - Fix kernel test reported error. > > Changes in v2: > - Removed NULL pointer check instead added a new state to signify > non-recoverable state of remoteproc. > > drivers/remoteproc/remoteproc_core.c | 3 ++- > drivers/remoteproc/remoteproc_sysfs.c | 1 + > include/linux/remoteproc.h | 5 ++++- > 3 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c > index f276956f2c5c..c4e14503b971 100644 > --- a/drivers/remoteproc/remoteproc_core.c > +++ b/drivers/remoteproc/remoteproc_core.c > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed) > /* power off the remote processor */ > ret = rproc->ops->stop(rproc); > if (ret) { > + rproc->state = RPROC_DEFUNCT; > dev_err(dev, "can't stop rproc: %d\n", ret); > return ret; > } > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc) > return ret; > > /* State could have changed before we got the mutex */ > - if (rproc->state != RPROC_CRASHED) > + if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED) > goto unlock_mutex; > > dev_err(dev, "recovering %s\n", rproc->name); > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c > index 138e752c5e4e..5f722b4576b2 100644 > --- a/drivers/remoteproc/remoteproc_sysfs.c > +++ b/drivers/remoteproc/remoteproc_sysfs.c > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = { > [RPROC_DELETED] = "deleted", > [RPROC_ATTACHED] = "attached", > [RPROC_DETACHED] = "detached", > + [RPROC_DEFUNCT] = "defunct", > [RPROC_LAST] = "invalid", > }; > > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h > index b4795698d8c2..3e4ba06c6a9a 100644 > --- a/include/linux/remoteproc.h > +++ b/include/linux/remoteproc.h > @@ -417,6 +417,8 @@ struct rproc_ops { > * has attached to it > * @RPROC_DETACHED: device has been booted by another entity and waiting > * for the core to attach to it > + * @RPROC_DEFUNCT: device neither crashed nor responding to any of the > + * requests and can only recover on system restart. > * @RPROC_LAST: just keep this one at the end > * > * Please note that the values of these states are used as indices > @@ -433,7 +435,8 @@ enum rproc_state { > RPROC_DELETED = 4, > RPROC_ATTACHED = 5, > RPROC_DETACHED = 6, > - RPROC_LAST = 7, > + RPROC_DEFUNCT = 7, > + RPROC_LAST = 8, > }; > > /** > -- > 2.34.1 > >