On Tue, Oct 01, 2024 at 12:06:17PM +0530, Mukesh Ojha wrote: > On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote: > > On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote: > > > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote: > > > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote: > > > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > > > > > if rproc_stop() fails from Process-A that leaves the rproc state to > > > > > RPROC_CRASHED state later a call to recovery_store from user space in > > > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to > > > > > recover it results in NULL pointer dereference issue in > > > > > qcom_glink_smem_unregister(). > > > > > > > > > > Fix it by having a NULL check in glink_subdev_stop(). > > > > > > > > > > Process-A Process-B > > > > > > > > > > fatal error interrupt happens > > > > > > > > > > rproc_crash_handler_work() > > > > > mutex_lock_interruptible(&rproc->lock); > > > > > ... > > > > > > > > > > rproc->state = RPROC_CRASHED; > > > > > ... > > > > > mutex_unlock(&rproc->lock); > > > > > > > > > > rproc_trigger_recovery() > > > > > mutex_lock_interruptible(&rproc->lock); > > > > > > > > > > adsp_stop() > > > > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > > > > > remoteproc remoteproc3: can't stop rproc: -22 > > > > > > > > I presume that at this point this remoteproc is in some undefined state > > > > and the only way to recover is for the user to reboot the machine? > > > > > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of > > > remote processor memory region, and i don't think, it is anyway to do > > > with remote processor state here, as a best effort more number of > > > retries can be tried instead of 50 or wait for some other recovery > > > command like recovery_store() to let it do the retry again from > > > beginning. > > > > > > > But are you saying that retrying a bit later would allow us to get out > > of this problem? If we just didn't hit the NULL pointer(s)? > > I am not sure whether adding more number of retries will solve the issue > and initially thinking from perspective that, it is better to retry than > to leave the remoteproc in some unknown state however, I do get that > letting it retry could give unnecessary patching every code e.g., ssr > notifier callbacks, which is not expecting to be called twice as a > side-effect of remoteproc unknown state. That's not what I'm asking you. When the remote processor fails to stop, can you recover the system by just trying a bit later, or is the remoteproc dead until reboot? > > > > How long are we talking about here? Is the timeout too short? > > 5sec is very significant amount of time in wait for remote processor to > get recovered, we should not change this. Okay, I'm just trying to understand the problem you're trying to solve. Regards, Bjorn > > > > > > > > > > > > > > The check for glink->edge avoids one pitfall following this, but I'd > > > > prefer to see a solution that avoids issues in this scenario in the > > > > remoteproc core - rather than working around side effects of this in > > > > different places. > > > > > > Handling in a remoteproc core means we may need another state something > > > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery > > > failure and checking the same during another try return immediately with > > > some log message. > > > > > > > Yes, if we are failing to shut down the remoteproc and there's no way > > for us to reliably recover from that (e.g. we are not able to reclaim > > the memory), it seems reasonable that we have to mark it using a new > > state. > > > > If that is the case, I'd call it RPROC_DEFUNCT (or something like that > > instead), because while in some "unknown" state, from a remoteproc > > framework's point of view, it's in a well known (broken) state. > > Ack. > > -Mukesh > > > > Regards, > > Bjorn