On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote: > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote: > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > > if rproc_stop() fails from Process-A that leaves the rproc state to > > RPROC_CRASHED state later a call to recovery_store from user space in > > Process B triggers rproc_trigger_recovery() of the same remoteproc to > > recover it results in NULL pointer dereference issue in > > qcom_glink_smem_unregister(). > > > > Fix it by having a NULL check in glink_subdev_stop(). > > > > Process-A Process-B > > > > fatal error interrupt happens > > > > rproc_crash_handler_work() > > mutex_lock_interruptible(&rproc->lock); > > ... > > > > rproc->state = RPROC_CRASHED; > > ... > > mutex_unlock(&rproc->lock); > > > > rproc_trigger_recovery() > > mutex_lock_interruptible(&rproc->lock); > > > > adsp_stop() > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > > remoteproc remoteproc3: can't stop rproc: -22 > > I presume that at this point this remoteproc is in some undefined state > and the only way to recover is for the user to reboot the machine? Here, 50+ (5s) retry of scm shutdown is failing during decryption of remote processor memory region, and i don't think, it is anyway to do with remote processor state here, as a best effort more number of retries can be tried instead of 50 or wait for some other recovery command like recovery_store() to let it do the retry again from beginning. > > > The check for glink->edge avoids one pitfall following this, but I'd > prefer to see a solution that avoids issues in this scenario in the > remoteproc core - rather than working around side effects of this in > different places. Handling in a remoteproc core means we may need another state something like "RPROC_UNKNOWN" which can be kept after one attempt of recovery failure and checking the same during another try return immediately with some log message. -Mukesh