On Wed 04 Nov 10:16 CST 2020, Bjorn Andersson wrote: > The reliance on the remoteproc's state for determining when to send > sysmon notifications to a remote processor is racy with regard to > concurrent remoteproc operations. > > Further more the advertisement of the state of other remote processor to > a newly started remote processor might not only send the wrong state, > but might result in a stream of state changes that are out of order. > > Address this by introducing state tracking within the sysmon instances > themselves and extend the locking to ensure that the notifications are > consistent with this state. > > The use of a big lock for all instances will cause contention for > concurrent remote processor state transitions, but the correctness of > the remote processors' view of their peers is more important. > > Fixes: 1f36ab3f6e3b ("remoteproc: sysmon: Inform current rproc about all active rprocs") > Fixes: 1877f54f75ad ("remoteproc: sysmon: Add notifications for events") > Fixes: 1fb82ee806d1 ("remoteproc: qcom: Introduce sysmon") > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Bjorn Andersson <bjorn.andersson@xxxxxxxxxx> > --- > drivers/remoteproc/qcom_sysmon.c | 20 ++++++++++++++++---- > 1 file changed, 16 insertions(+), 4 deletions(-) > > diff --git a/drivers/remoteproc/qcom_sysmon.c b/drivers/remoteproc/qcom_sysmon.c > index 9eb2f6bccea6..1e507b66354a 100644 > --- a/drivers/remoteproc/qcom_sysmon.c > +++ b/drivers/remoteproc/qcom_sysmon.c > @@ -22,6 +22,8 @@ struct qcom_sysmon { > struct rproc_subdev subdev; > struct rproc *rproc; > > + int state; > + > struct list_head node; > > const char *name; > @@ -448,7 +450,10 @@ static int sysmon_prepare(struct rproc_subdev *subdev) > .ssr_event = SSCTL_SSR_EVENT_BEFORE_POWERUP > }; > > + mutex_lock(&sysmon_lock); This doesn't work, because taking the big lock prevents a concurrently failing remote processor from reaching smd orglink to indicate that that remote is dead and the first remote's notifications should be aborted/fail fast. The result is in most cases that we're stuck here waiting for a timeout, but there are extreme corner cases where the notification might be waiting for the dead remote to drain the communication fifo. Will send a new version that don't rely on the big lock, but still keeps state information consistent. Regards, Bjorn