Re: race condition issue at remote proc startup

Arnaud POULIQUEN <arnaud.pouliquen@xxxxxxxxxxx> · Tue, 4 May 2021 18:43:47 +0200

Hello Yann

On 5/4/21 11:45 AM, Yann Sionneau wrote:
> Hello,
> 
> We (at Kalray) have some difficulties during initialization of a remoteproc
> device, and there seem to have no clean way (at least not one we know of) out of
> this problem.
> 
> We need vring defined in the resource table to be completely initialized before
> the remoteproc device is started. By completely initialized I mean that the
> vring device address defined in resource table shall be changed from 0xff..ff to
> a proper address. Currently the remote device is started before the
> initialization has completed, which creates a race condition between Linux and
> the remoteproc device. (We have a particular architecture in which the processor
> running Linux is the same as the embedded processor, this is why this problem
> happens in our case but probably not when the processor running Linux is much
> faster than the embedded processor).

Is the remote side waiting for the vdev status[1] update before accessing the
vrings?

[1] https://elixir.bootlin.com/linux/latest/source/include/linux/remoteproc.h#L307

> 
> Our best attempt up to now is to configure the virtio ring sooner i.e during
> subdevice preparation instead of subdevice start.
> i.e. in rproc_handle_vdev change code from
>     rvdev->subdev.start = rproc_vdev_do_start;
> to
>     /* da field in vring must be initialized before powering up
>      * the remoterproc, or else race condition may occur.
>      * Indeed the remoteproc may read it before it has been initialized.
>      */
>     rvdev->subdev.prepare = rproc_vdev_do_start;
> 
> This works but it has undesired side effects. In particular some notifications
> are sent (the remote proc kick function is being called), but since the remote
> CPU has not been started yet we are not able to handle them, thus we simply
> ignore them if the state of the remote proc is not RUNNING.
> At least this seems to solve our problem, but this is a particularly unpleasant
> way of solving the problem, in particular it might impact the existing
> remoteproc devices. Do you have any suggestion on some cleaner to way to solve
> this problem?
> 
> FYI, here is our arch specific remote proc implementation:
> https://github.com/kalray/linux_coolidge/blob/coolidge/drivers/remoteproc/kvx_remoteproc.c
> 
> 
> PS: there seem to be a similar problem when the remote device is being stopped.
> The vring buffer are destroyed and only after is the remote proc device stopped.
> There is once again a race condition as the remote proc device might try to
> access the vring after their destruction by the host. Proposed change is as follow:
> In rproc_handle_vdev change code from
>     rvdev->subdev.stop = rproc_vdev_do_stop;
> to
>     rvdev->subdev.unprepare = rproc_vdev_do_stop;

Should also be handled with the vdev status.

> 
> Note this change has much less impact on existing remote proc and is symmetric
> to the previous change thus it might make it sound more logical
> 
> PS2: I guess that this issue never showed up before because most other use cases
> are using fixed addresses in the resource tables and not dynamically allocated
> ones at runtime.

We use dynamic vring address allocation without any issue on STM323MP1 platform,
with the coprocessor started before the main processor running Linux.

Regards,
Arnaud

> 
> Regards,
>