Hello Yann On 5/4/21 11:45 AM, Yann Sionneau wrote: > Hello, > > We (at Kalray) have some difficulties during initialization of a remoteproc > device, and there seem to have no clean way (at least not one we know of) out of > this problem. > > We need vring defined in the resource table to be completely initialized before > the remoteproc device is started. By completely initialized I mean that the > vring device address defined in resource table shall be changed from 0xff..ff to > a proper address. Currently the remote device is started before the > initialization has completed, which creates a race condition between Linux and > the remoteproc device. (We have a particular architecture in which the processor > running Linux is the same as the embedded processor, this is why this problem > happens in our case but probably not when the processor running Linux is much > faster than the embedded processor). Is the remote side waiting for the vdev status[1] update before accessing the vrings? [1] https://elixir.bootlin.com/linux/latest/source/include/linux/remoteproc.h#L307 > > Our best attempt up to now is to configure the virtio ring sooner i.e during > subdevice preparation instead of subdevice start. > i.e. in rproc_handle_vdev change code from > rvdev->subdev.start = rproc_vdev_do_start; > to > /* da field in vring must be initialized before powering up > * the remoterproc, or else race condition may occur. > * Indeed the remoteproc may read it before it has been initialized. > */ > rvdev->subdev.prepare = rproc_vdev_do_start; > > This works but it has undesired side effects. In particular some notifications > are sent (the remote proc kick function is being called), but since the remote > CPU has not been started yet we are not able to handle them, thus we simply > ignore them if the state of the remote proc is not RUNNING. > At least this seems to solve our problem, but this is a particularly unpleasant > way of solving the problem, in particular it might impact the existing > remoteproc devices. Do you have any suggestion on some cleaner to way to solve > this problem? > > FYI, here is our arch specific remote proc implementation: > https://github.com/kalray/linux_coolidge/blob/coolidge/drivers/remoteproc/kvx_remoteproc.c > > > PS: there seem to be a similar problem when the remote device is being stopped. > The vring buffer are destroyed and only after is the remote proc device stopped. > There is once again a race condition as the remote proc device might try to > access the vring after their destruction by the host. Proposed change is as follow: > In rproc_handle_vdev change code from > rvdev->subdev.stop = rproc_vdev_do_stop; > to > rvdev->subdev.unprepare = rproc_vdev_do_stop; Should also be handled with the vdev status. > > Note this change has much less impact on existing remote proc and is symmetric > to the previous change thus it might make it sound more logical > > PS2: I guess that this issue never showed up before because most other use cases > are using fixed addresses in the resource tables and not dynamically allocated > ones at runtime. We use dynamic vring address allocation without any issue on STM323MP1 platform, with the coprocessor started before the main processor running Linux. Regards, Arnaud > > Regards, >