Re: [PATCH] soc: qcom: rpmh: Avoid accessing freed memory from batch API

Stephen Boyd <swboyd@xxxxxxxxxxxx> · Tue, 08 Jan 2019 10:52:29 -0800

Quoting Evan Green (2019-01-08 10:30:04)
> On Tue, Jan 8, 2019 at 9:49 AM Lina Iyer <ilina@xxxxxxxxxxxxxx> wrote:
> >
> > diff --git a/drivers/soc/qcom/rpmh.c b/drivers/soc/qcom/rpmh.c
> > index c7beb6841289..0303a2971d4a 100644
> > --- a/drivers/soc/qcom/rpmh.c
> > +++ b/drivers/soc/qcom/rpmh.c
> > @@ -80,6 +80,7 @@ void rpmh_tx_done(const struct tcs_request *msg, int r)
> >         struct rpmh_request *rpm_msg = container_of(msg, struct rpmh_request,
> >                                                     msg);
> >         struct completion *compl = rpm_msg->completion;
> > +       bool free = rpm_msg->needs_free;
> >
> >         rpm_msg->err = r;
> >
> > @@ -94,7 +95,7 @@ void rpmh_tx_done(const struct tcs_request *msg, int r)
> >         complete(compl);
> >
> >  exit:
> > -       if (rpm_msg->needs_free)
> > +       if (free)
> >                 kfree(rpm_msg);
> >  }
> >
> 
> Hi Lina,
> I think that's a worthy fix, too, and is needed to solve the issue you describe.

Looks like we need both fixes so I can combine them together.

> 
> But I think Stephen's fix is still needed. In the rpmh_write_batch
> scenario, we queue N things into rpmh, but set the same completion for
> all of them. If only the first one completes but not the others, then
> the loop in rpmh_write_batch will call wait_for_completion_timeout N
> times on the same completion, and then goes on to free all N requests,
> even though only the first one is actually done and out of the system
> (well, almost out of the system, with the bug you noticed above).

This code looks an awful lot like kref_put() with a release function
that's a kfree() of the rpm message. Would that simplify the code
somewhat if we made a refcounter (that probably only counted up to 2)
and then properly refcounted the messages? We would have to allocate a
bunch of messages for the batch writing API, but I'm not sure that's a
big deal either.

> 
> We considered having just one completion on the last transfer, but
> then if there's an error part way through you have no way of waiting
> on the transfers that did get submitted. So I think N completions are
> needed.

For that, I'd like to make the batch API know more about the TCS it's
filling so it can know how to unwind the state if something fails. From
what I can tell this function is implemented at the wrong abstraction
level. It should be a lower level function so it can manage the queue
and push multiple messages through.