Quoting Evan Green (2019-01-08 10:30:04) > On Tue, Jan 8, 2019 at 9:49 AM Lina Iyer <ilina@xxxxxxxxxxxxxx> wrote: > > > > diff --git a/drivers/soc/qcom/rpmh.c b/drivers/soc/qcom/rpmh.c > > index c7beb6841289..0303a2971d4a 100644 > > --- a/drivers/soc/qcom/rpmh.c > > +++ b/drivers/soc/qcom/rpmh.c > > @@ -80,6 +80,7 @@ void rpmh_tx_done(const struct tcs_request *msg, int r) > > struct rpmh_request *rpm_msg = container_of(msg, struct rpmh_request, > > msg); > > struct completion *compl = rpm_msg->completion; > > + bool free = rpm_msg->needs_free; > > > > rpm_msg->err = r; > > > > @@ -94,7 +95,7 @@ void rpmh_tx_done(const struct tcs_request *msg, int r) > > complete(compl); > > > > exit: > > - if (rpm_msg->needs_free) > > + if (free) > > kfree(rpm_msg); > > } > > > > Hi Lina, > I think that's a worthy fix, too, and is needed to solve the issue you describe. Looks like we need both fixes so I can combine them together. > > But I think Stephen's fix is still needed. In the rpmh_write_batch > scenario, we queue N things into rpmh, but set the same completion for > all of them. If only the first one completes but not the others, then > the loop in rpmh_write_batch will call wait_for_completion_timeout N > times on the same completion, and then goes on to free all N requests, > even though only the first one is actually done and out of the system > (well, almost out of the system, with the bug you noticed above). This code looks an awful lot like kref_put() with a release function that's a kfree() of the rpm message. Would that simplify the code somewhat if we made a refcounter (that probably only counted up to 2) and then properly refcounted the messages? We would have to allocate a bunch of messages for the batch writing API, but I'm not sure that's a big deal either. > > We considered having just one completion on the last transfer, but > then if there's an error part way through you have no way of waiting > on the transfers that did get submitted. So I think N completions are > needed. For that, I'd like to make the batch API know more about the TCS it's filling so it can know how to unwind the state if something fails. From what I can tell this function is implemented at the wrong abstraction level. It should be a lower level function so it can manage the queue and push multiple messages through.