race condition issue at remote proc startup

Yann Sionneau <ysionneau@xxxxxxxxx> · Tue, 4 May 2021 11:45:31 +0200

Hello,

We (at Kalray) have some difficulties during initialization of a 
remoteproc device, and there seem to have no clean way (at least not one 
we know of) out of this problem.

We need vring defined in the resource table to be completely initialized 
before the remoteproc device is started. By completely initialized I 
mean that the vring device address defined in resource table shall be 
changed from 0xff..ff to a proper address. Currently the remote device 
is started before the initialization has completed, which creates a race 
condition between Linux and the remoteproc device. (We have a particular 
architecture in which the processor running Linux is the same as the 
embedded processor, this is why this problem happens in our case but 
probably not when the processor running Linux is much faster than the 
embedded processor).

Our best attempt up to now is to configure the virtio ring sooner i.e 
during subdevice preparation instead of subdevice start.
i.e. in rproc_handle_vdev change code from
    rvdev->subdev.start = rproc_vdev_do_start;
to
    /* da field in vring must be initialized before powering up
     * the remoterproc, or else race condition may occur.
     * Indeed the remoteproc may read it before it has been initialized.
     */
    rvdev->subdev.prepare = rproc_vdev_do_start;

This works but it has undesired side effects. In particular some 
notifications are sent (the remote proc kick function is being called), 
but since the remote CPU has not been started yet we are not able to 
handle them, thus we simply ignore them if the state of the remote proc 
is not RUNNING.
At least this seems to solve our problem, but this is a particularly 
unpleasant way of solving the problem, in particular it might impact the 
existing remoteproc devices. Do you have any suggestion on some cleaner 
to way to solve this problem?

FYI, here is our arch specific remote proc implementation: 
https://github.com/kalray/linux_coolidge/blob/coolidge/drivers/remoteproc/kvx_remoteproc.c

PS: there seem to be a similar problem when the remote device is being 
stopped. The vring buffer are destroyed and only after is the remote 
proc device stopped. There is once again a race condition as the remote 
proc device might try to access the vring after their destruction by the 
host. Proposed change is as follow:
In rproc_handle_vdev change code from
    rvdev->subdev.stop = rproc_vdev_do_stop;
to
    rvdev->subdev.unprepare = rproc_vdev_do_stop;

Note this change has much less impact on existing remote proc and is 
symmetric to the previous change thus it might make it sound more logical

PS2: I guess that this issue never showed up before because most other 
use cases are using fixed addresses in the resource tables and not 
dynamically allocated ones at runtime.

Regards,

--
Yann