On 9/13/23 12:57, Agostiño Carballeira wrote: > Hello! > > I am the main developer of the CM4 firmware for this project. > First of all thanks for taking the time to analyse the trace. > Further analysis on my side has shown that the CM4 is not completely stalled in > this situation, but it is stuck on a busywait loop within the MAILBOX_Notify > function, awaiting a window to send a "buffer used" notification to the CA7. So > it seems that the mailbox is locked both ways and neither side is giving way to > unclog the traffic jam. The Cortex-M4 is probably blocked, waiting for Linux mailbox to acknowledge a mailbox notification[1]. Are you 100% sure that you have never exited this loop during the issue? Could you provide the value of the 'id' variable and the call stack? That would mean that the mailbox has not been acknowledged by Linux [2] or [3]. I don't understand how it could be possible... [1] https://github.com/STMicroelectronics/STM32CubeMP1/blob/master/Middlewares/Third_Party/OpenAMP/mw_if/platform_if/mbox_ipcc_template.c#L182 [2] https://elixir.bootlin.com/linux/latest/source/drivers/mailbox/stm32-ipcc.c#L105 [3] https://elixir.bootlin.com/linux/latest/source/drivers/mailbox/stm32-ipcc.c#L134 > Interestingly, when we replace rpmsg_send by rpmsg_trysend + busywait loop, this > mutual stall doesn't happen at all. What do you mean by busywait? Do you add a delay between 2 rpmsg_trysend() calls? If yes, you probably add delay that avoid the issue. That said rpmsg_trysend is recommended for baremetal to avoid to block the system. > Does that give you any clues? > > Thanks > Agos > > On Wed, Sep 13, 2023 at 10:47 AM Tim Blechmann <tim@xxxxxxxxxx > <mailto:tim@xxxxxxxxxx>> wrote: > > many thanks for your analysis, very interesting. > > > please find below an extract of your trace with my analysis: > > > > > > stm32mp1_bulk_p-390 [001] ..... 907.241226: rpmsg_send > > <-rpmsg_intercore_send_buffer.constprop.0 > > stm32mp1_bulk_p-390 [001] ..... 907.241228: virtio_rpmsg_send > <-rpmsg_send > > stm32mp1_bulk_p-390 [001] ..... 907.241237: virtqueue_enable_cb > > <-rpmsg_send_offchannel_raw > > stm32mp1_bulk_p-390 [001] ..... 907.241239: > virtqueue_enable_cb_prepare > > > > At this point seems that no more TX-buffer > > > > <-rpmsg_recv_single > > kworker/0:4-67 [000] ..... 907.242533: vring_interrupt > > <-rproc_vq_interrupt > > kworker/0:4-67 [000] ..... 907.242536: rpmsg_xmit_done > > > > Here you receive an interrupt indicating that TX buffer has been released by > > the remote. that's the expected behavior. > > > > > > <-rpmsg_send_offchannel_raw > > stm32mp1_bulk_p-390 [000] ..... 984.054941: rpmsg_send > > <-rpmsg_intercore_send_buffer.constprop.0 > > stm32mp1_bulk_p-390 [000] ..... 984.054943: virtio_rpmsg_send > <-rpmsg_send > > stm32mp1_bulk_p-390 [000] ..... 984.054956: virtqueue_enable_cb > > <-rpmsg_send_offchannel_raw > > stm32mp1_bulk_p-390 [000] ..... 984.054958: > virtqueue_enable_cb_prepare > > <-virtqueue_enable_cb > > stm32mp1_bulk_p-390 [000] ..... 999.398667: virtqueue_disable_cb > > <-rpmsg_send_offchannel_raw > > stm32mp1_bulk_p-390 [000] ..... 999.414840: rpmsg_send > > <-rpmsg_intercore_send_buffer.constprop.0 > > stm32mp1_bulk_p-390 [000] ..... 999.414843: virtio_rpmsg_send > <-rpmsg_send > > stm32mp1_bulk_p-390 [000] ..... 999.414855: virtqueue_enable_cb > > <-rpmsg_send_offchannel_raw > > stm32mp1_bulk_p-390 [000] ..... 999.414857: > virtqueue_enable_cb_prepare > > > > Here you have again no more TX buffer. From this point there is no more > activity > > neither in TX nor in RX until the timeout of 15 seconds. > > If you have a look to rproc_vq_interrupt the last one occurs at 907.242533 > > > > > > As you have no more virtqueue interrupts call for both directions, the > issue is > > probably either in the Cortex-M firmware, which seems to be stalled, or > due to a > > disable of the IRQs in Linux. > > afaict we can rule out a complete stall of the cortex-m firmware: if we > change the rpmsg_send to a rpmsg_trysend/msleep loop, the trysend will > succeed to get a buffer after a few iterations. > > > or due to a disable of the IRQs in Linux. > > do you have some recommendations how we could trace this? > > many thanks, > tim > > > <-virtqueue_enable_cb > > stm32mp1_bulk_p-390 [000] ..... 1014.758678: virtqueue_disable_cb > > <-rpmsg_send_offchannel_raw > > stm32mp1_bulk_p-390 [000] ..... 1014.774802: rpmsg_send > > <-rpmsg_intercore_send_buffer.constprop.0 > > stm32mp1_bulk_p-390 [000] ..... 1014.774804: virtio_rpmsg_send > <-rpmsg_send > > stm32mp1_bulk_p-390 [000] ..... 1014.774815: virtqueue_enable_cb > > > > > -- > > Agostiño Carballeira > > Senior Embedded Software Engineer > > agostino.carballeira@xxxxxxxxxxxxxxxxxxxxxx > <mailto:agostino.carballeira@xxxxxxxxxxxxxxxxxxxxxx> > > > Native Instruments <https://www.native-instruments.com>– now including iZotope, > Plugin Alliance, and Brainworx