Re: [PATCH 1/1] rpmsg: virtio_rpmsg_bus - prevent possible race condition

Tim Blechmann <tim@xxxxxxxxxx> · Wed, 13 Sep 2023 09:07:06 +0800

virtio emits a few times:
[  404.XXXXXX] No more buffers in queue
and then
[  404.588722] remoteproc remoteproc0: vq index 1 is interrupted
[  404.597249] virtqueue interrupt with no work for 6d53f13a

Difficult to understand the scheduling with your trace. Could you enable ftrace to
observe it (mailbox interrupts,rpmsg and virtio functions)

does this help (it's a trace that a coworker did a few days ago)?

```
root:.../kernel/tracing# cat trace | head -50
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0) + 10.250 us   |        rpmsg_sg_init();
 0) + 42.375 us   |      } /* rpmsg_send_offchannel_raw */
 0) + 51.125 us   |    } /* virtio_rpmsg_send */
 0) + 59.625 us   |  } /* rpmsg_send */
 ------------------------------------------
 0)  stm32mp-414   =>  kworker-365
 ------------------------------------------

 0)               |  rpmsg_recv_done() {
 0)               |    rpmsg_recv_single() {
 0)   5.542 us    |      rpmsg_sg_init();
 0) + 36.542 us   |    }
 0) + 48.250 us   |  }
 ------------------------------------------
 0)  kworker-365   =>  stm32mp-414
 ------------------------------------------

 0)   7.000 us    |  rpmsg_get_buffer_size();
 0)               |  rpmsg_send() {
 0)               |    virtio_rpmsg_send() {
 0)               |      rpmsg_send_offchannel_raw() {
 0)   5.625 us    |        rpmsg_sg_init();
 0) + 37.333 us   |      }
 0) + 45.916 us   |    }
 0) + 54.250 us   |  }
 ------------------------------------------
 0)  stm32mp-414   =>  kworker-365
 ------------------------------------------

 0)               |  rpmsg_recv_done() {
 0)               |    rpmsg_recv_single() {
 0)   6.041 us    |      rpmsg_sg_init();
 0) + 37.292 us   |    }
 0) + 49.542 us   |  }
 ------------------------------------------
 0)  kworker-365   =>  stm32mp-414
 ------------------------------------------

 0)   6.500 us    |  rpmsg_get_buffer_size();
 0)               |  rpmsg_send() {
 0)               |    virtio_rpmsg_send() {
 0)               |      rpmsg_send_offchannel_raw() {
 0)   5.417 us    |        rpmsg_sg_init();
 0) + 36.583 us   |      }
 0) + 45.167 us   |    }
 0) + 53.583 us   |  }
```

apparently this is where it gets stuck (not sure if it's enough context, 
though).

If yes what is not clear to me is that wait_event_interruptible_timeout() seems
to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. A
free TX buffer should be found at this step.

[1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534

hmm, interesting. but that would imply a timing issue where calling get_a_tx_buf
twice somehow makes a difference as opposed to calling it only once.

would the "interrupt with no work" point to a different cause for the issue that
we are seeing?

Look like you have not released the buffer on remote side so no buffer in vring
used list.

Do you use the zero copy (rpmsg_hold_rx_buffer) on remote processor?
If yes, please check that you have following fix:
https://github.com/OpenAMP/open-amp/commit/2b6a38a9c6890869960ed41d6a1dc9930c079e0d

yes, we use rpmsg_hold_rx_buffer, however the fix from 
2b6a38a9c6890869960ed41d6a1dc9930c079e0d apparently did not change the 
behaviour.

many thanks,
tim
Attachment:
OpenPGP_signature

Description: OpenPGP digital signature