Re: [PATCH v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT

anish kumar <yesanishhere@xxxxxxxxx> · Wed, 16 Oct 2024 09:04:27 -0700

On Tue, Oct 15, 2024 at 9:57 PM Mukesh Ojha <quic_mojha@xxxxxxxxxxx> wrote:
>
> Multiple call to glink_subdev_stop() for the same remoteproc can happen
> if rproc_stop() fails from Process-A that leaves the rproc state to
> RPROC_CRASHED state later a call to recovery_store from user space in
> Process B triggers rproc_trigger_recovery() of the same remoteproc to
> recover it results in NULL pointer dereference issue in
> qcom_glink_smem_unregister().
>
> There is other side to this issue if we want to fix this via adding a
> NULL check on glink->edge which does not guarantees that the remoteproc
> will recover in second call from Process B as it has failed in the first
> Process A during SMC shutdown call and may again fail at the same call
> and rproc can not recover for such case.

What is the guarantee that the second stop also will fail? I feel
it should be handled in user space, if rproc calls are failing then
there is a bigger issue and then let userspace decide what to do if it
is happening continuously. Also, why not add this DEFUNCT_STATE
in other callbacks, as all callbacks from core to rproc driver can fail?
>
> Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of

Even if this state is present, ultimately it will be up to user space to
decide what to do, right?

> remoteproc and the only way to recover from it via system restart.
>
>         Process-A                                       Process-B
>
>   fatal error interrupt happens
>
>   rproc_crash_handler_work()
>     mutex_lock_interruptible(&rproc->lock);
>     ...
>
>        rproc->state = RPROC_CRASHED;
>     ...
>     mutex_unlock(&rproc->lock);
>
>     rproc_trigger_recovery()
>      mutex_lock_interruptible(&rproc->lock);
>
>       adsp_stop()
>       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
>       remoteproc remoteproc3: can't stop rproc: -22
>      mutex_unlock(&rproc->lock);
>
>                                                 echo enabled > /sys/class/remoteproc/remoteprocX/recovery
>                                                 recovery_store()
>                                                  rproc_trigger_recovery()
>                                                   mutex_lock_interruptible(&rproc->lock);
>                                                    rproc_stop()
>                                                     glink_subdev_stop()
>                                                       qcom_glink_smem_unregister() ==|
>                                                                                      |
>                                                                                      V
>                                                       Unable to handle kernel NULL pointer dereference
>                                                                 at virtual address 0000000000000358
>
> Signed-off-by: Mukesh Ojha <quic_mojha@xxxxxxxxxxx>
> ---
> Changes in v3:
>  - Fix kernel test reported error.
>
> Changes in v2:
>  - Removed NULL pointer check instead added a new state to signify
>    non-recoverable state of remoteproc.
>
>  drivers/remoteproc/remoteproc_core.c  | 3 ++-
>  drivers/remoteproc/remoteproc_sysfs.c | 1 +
>  include/linux/remoteproc.h            | 5 ++++-
>  3 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index f276956f2c5c..c4e14503b971 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
>         /* power off the remote processor */
>         ret = rproc->ops->stop(rproc);
>         if (ret) {
> +               rproc->state = RPROC_DEFUNCT;
>                 dev_err(dev, "can't stop rproc: %d\n", ret);
>                 return ret;
>         }
> @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
>                 return ret;
>
>         /* State could have changed before we got the mutex */
> -       if (rproc->state != RPROC_CRASHED)
> +       if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
>                 goto unlock_mutex;
>
>         dev_err(dev, "recovering %s\n", rproc->name);
> diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
> index 138e752c5e4e..5f722b4576b2 100644
> --- a/drivers/remoteproc/remoteproc_sysfs.c
> +++ b/drivers/remoteproc/remoteproc_sysfs.c
> @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
>         [RPROC_DELETED]         = "deleted",
>         [RPROC_ATTACHED]        = "attached",
>         [RPROC_DETACHED]        = "detached",
> +       [RPROC_DEFUNCT]         = "defunct",
>         [RPROC_LAST]            = "invalid",
>  };
>
> diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> index b4795698d8c2..3e4ba06c6a9a 100644
> --- a/include/linux/remoteproc.h
> +++ b/include/linux/remoteproc.h
> @@ -417,6 +417,8 @@ struct rproc_ops {
>   *                     has attached to it
>   * @RPROC_DETACHED:    device has been booted by another entity and waiting
>   *                     for the core to attach to it
> + * @RPROC_DEFUNCT:     device neither crashed nor responding to any of the
> + *                     requests and can only recover on system restart.
>   * @RPROC_LAST:                just keep this one at the end
>   *
>   * Please note that the values of these states are used as indices
> @@ -433,7 +435,8 @@ enum rproc_state {
>         RPROC_DELETED   = 4,
>         RPROC_ATTACHED  = 5,
>         RPROC_DETACHED  = 6,
> -       RPROC_LAST      = 7,
> +       RPROC_DEFUNCT   = 7,
> +       RPROC_LAST      = 8,
>  };
>
>  /**
> --
> 2.34.1
>
>