Hello,
Thank you Arnaud for your very quick and interesting answer.
Thanks to your pointers Julien identified the *real* issue.
It was an issue in our OpenAMP code which was allocating DAs even in the
"slave" code path, instead of just doing it in the "master" code path.
Thus OpenAMP was actually overwritting the DAs that Linux wrote.
Issue fixed, and without any Linux patch ;)
Thanks a lot!
Regards,
Yann
On 04/05/2021 19:26, Julien Hascoet wrote:
Hello, thanks for the quick answer,
Is the remote side waiting for the vdev status[1] update before accessing the vrings?
[1] https://elixir.bootlin.com/linux/latest/source/include/linux/remoteproc.h#L307
We are using the openamp project that does it for us: https://github.com/OpenAMP/open-amp/blob/master/lib/remoteproc/remoteproc.c#L925
I also checked the defines and they seem aligned on both sides:
linux: include/uapi/linux/virtio_config.h:#define VIRTIO_CONFIG_S_DRIVER_OK 4
openamp: lib/include/openamp/virtio.h:#define VIRTIO_CONFIG_STATUS_DRIVER_OK 0x04
Thanks !
----- Original Message -----
From: "Arnaud POULIQUEN" <arnaud.pouliquen@xxxxxxxxxxx>
To: "Yann Sionneau" <ysionneau@xxxxxxxxx>, linux-remoteproc@xxxxxxxxxxxxxxx
Cc: "Pierre-Yves Kerbrat" <pkerbrat@xxxxxxxxxxxxx>, "Vincent Chardon" <vchardon@xxxxxxxxxxxxx>, "jhascoet" <jhascoet@xxxxxxxxxxxxx>
Sent: Tuesday, May 4, 2021 6:43:47 PM
Subject: Re: race condition issue at remote proc startup
Hello Yann
On 5/4/21 11:45 AM, Yann Sionneau wrote:
Hello,
We (at Kalray) have some difficulties during initialization of a remoteproc
device, and there seem to have no clean way (at least not one we know of) out of
this problem.
We need vring defined in the resource table to be completely initialized before
the remoteproc device is started. By completely initialized I mean that the
vring device address defined in resource table shall be changed from 0xff..ff to
a proper address. Currently the remote device is started before the
initialization has completed, which creates a race condition between Linux and
the remoteproc device. (We have a particular architecture in which the processor
running Linux is the same as the embedded processor, this is why this problem
happens in our case but probably not when the processor running Linux is much
faster than the embedded processor).
Is the remote side waiting for the vdev status[1] update before accessing the
vrings?
[1] https://elixir.bootlin.com/linux/latest/source/include/linux/remoteproc.h#L307
Our best attempt up to now is to configure the virtio ring sooner i.e during
subdevice preparation instead of subdevice start.
i.e. in rproc_handle_vdev change code from
rvdev->subdev.start = rproc_vdev_do_start;
to
/* da field in vring must be initialized before powering up
* the remoterproc, or else race condition may occur.
* Indeed the remoteproc may read it before it has been initialized.
*/
rvdev->subdev.prepare = rproc_vdev_do_start;
This works but it has undesired side effects. In particular some notifications
are sent (the remote proc kick function is being called), but since the remote
CPU has not been started yet we are not able to handle them, thus we simply
ignore them if the state of the remote proc is not RUNNING.
At least this seems to solve our problem, but this is a particularly unpleasant
way of solving the problem, in particular it might impact the existing
remoteproc devices. Do you have any suggestion on some cleaner to way to solve
this problem?
FYI, here is our arch specific remote proc implementation:
https://github.com/kalray/linux_coolidge/blob/coolidge/drivers/remoteproc/kvx_remoteproc.c
PS: there seem to be a similar problem when the remote device is being stopped.
The vring buffer are destroyed and only after is the remote proc device stopped.
There is once again a race condition as the remote proc device might try to
access the vring after their destruction by the host. Proposed change is as follow:
In rproc_handle_vdev change code from
rvdev->subdev.stop = rproc_vdev_do_stop;
to
rvdev->subdev.unprepare = rproc_vdev_do_stop;
Should also be handled with the vdev status.
Note this change has much less impact on existing remote proc and is symmetric
to the previous change thus it might make it sound more logical
PS2: I guess that this issue never showed up before because most other use cases
are using fixed addresses in the resource tables and not dynamically allocated
ones at runtime.
We use dynamic vring address allocation without any issue on STM323MP1 platform,
with the coprocessor started before the main processor running Linux.
Regards,
Arnaud
Regards,