Hello,
We (at Kalray) have some difficulties during initialization of a
remoteproc device, and there seem to have no clean way (at least not one
we know of) out of this problem.
We need vring defined in the resource table to be completely initialized
before the remoteproc device is started. By completely initialized I
mean that the vring device address defined in resource table shall be
changed from 0xff..ff to a proper address. Currently the remote device
is started before the initialization has completed, which creates a race
condition between Linux and the remoteproc device. (We have a particular
architecture in which the processor running Linux is the same as the
embedded processor, this is why this problem happens in our case but
probably not when the processor running Linux is much faster than the
embedded processor).
Our best attempt up to now is to configure the virtio ring sooner i.e
during subdevice preparation instead of subdevice start.
i.e. in rproc_handle_vdev change code from
rvdev->subdev.start = rproc_vdev_do_start;
to
/* da field in vring must be initialized before powering up
* the remoterproc, or else race condition may occur.
* Indeed the remoteproc may read it before it has been initialized.
*/
rvdev->subdev.prepare = rproc_vdev_do_start;
This works but it has undesired side effects. In particular some
notifications are sent (the remote proc kick function is being called),
but since the remote CPU has not been started yet we are not able to
handle them, thus we simply ignore them if the state of the remote proc
is not RUNNING.
At least this seems to solve our problem, but this is a particularly
unpleasant way of solving the problem, in particular it might impact the
existing remoteproc devices. Do you have any suggestion on some cleaner
to way to solve this problem?
FYI, here is our arch specific remote proc implementation:
https://github.com/kalray/linux_coolidge/blob/coolidge/drivers/remoteproc/kvx_remoteproc.c
PS: there seem to be a similar problem when the remote device is being
stopped. The vring buffer are destroyed and only after is the remote
proc device stopped. There is once again a race condition as the remote
proc device might try to access the vring after their destruction by the
host. Proposed change is as follow:
In rproc_handle_vdev change code from
rvdev->subdev.stop = rproc_vdev_do_stop;
to
rvdev->subdev.unprepare = rproc_vdev_do_stop;
Note this change has much less impact on existing remote proc and is
symmetric to the previous change thus it might make it sound more logical
PS2: I guess that this issue never showed up before because most other
use cases are using fixed addresses in the resource tables and not
dynamically allocated ones at runtime.
Regards,
--
Yann