On 9/9/23 08:28, Tim Blechmann wrote: >>>>>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` >>>>>> enables tx-complete interrupt. >>>>>> however if the interrupt is executed after `get_a_tx_buf` and before >>>>>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep >>>>>> for the full 15 seconds. >>>>> >>>>> >>>>> Is there any reason why your co-processor is unable to release the TX RPMSG >>>>> buffers for 15 seconds? If not, you should first determine the reason why >>>>> it is >>>>> stalled. >>>> >>>> Arnaud's concern is valid. If the remote processor can't consume a buffer >>>> within 15 seconds, something is probably wrong. >>>> >>>> That said, I believe your assesment of the situation is correct. *If* the TX >>>> callback is disabled and there is no buffer available, there is a window of >>>> opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an >>>> interrupt to arrive in function rpmsg_send_offchannel_raw(). >>> >>> the remote processor certainly releases the tx buffer and according to my >>> tracing the `vring_interrupt` fires immediately before `rpmsg_send` enters the >>> `rpmsg_upref_sleepers`. >> >> >> If I well understood your point, the issue occur in following race condition >> >> - all the Tx buffers are used >> - in rpmsg_send_offchannel_raw() function, we try to get a buffer using >> get_a_tx_buf(vrp) that returns NULL >> - rpmsg_xmit_done is called as a Tx buffer is released by the remote processor >> and now free >> - in rpmsg_send_offchannel_raw() rpmsg_upref_sleepers is called >> >> At this point you are nothing happen until 15 second because rpmsg_xmit_done is >> never called again that would wake up the waitqueue to call get_a_tx_buf() > > yes, this is the behaviour that i've been seeing: > > virtio emits a few times: > [ 404.XXXXXX] No more buffers in queue > and then > [ 404.588722] remoteproc remoteproc0: vq index 1 is interrupted > [ 404.597249] virtqueue interrupt with no work for 6d53f13a Difficult to understand the scheduling with your trace. Could you enable ftrace to observe it (mailbox interrupts,rpmsg and virtio functions) > >> If yes what is not clear to me is that wait_event_interruptible_timeout() seems >> to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. A >> free TX buffer should be found at this step. >> >> [1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534 > > hmm, interesting. but that would imply a timing issue where calling get_a_tx_buf > twice somehow makes a difference as opposed to calling it only once. > > would the "interrupt with no work" point to a different cause for the issue that > we are seeing? Look like you have not released the buffer on remote side so no buffer in vring used list. Do you use the zero copy (rpmsg_hold_rx_buffer) on remote processor? If yes, please check that you have following fix: https://github.com/OpenAMP/open-amp/commit/2b6a38a9c6890869960ed41d6a1dc9930c079e0d > >> >> Regards, >> Arnaud >> >>> >>> after applying this patch we haven't been able to reproduce the 15s timeout >>> anymore, whereas before we could easily reproduce it with certain workloads. >>> >>>> 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to >>>> soak. No error are locks are reported due to this patch during that time. >>> >>> mentioning locks: i was a bit uncertain about a good way to implement the retry, >>> since both `rpmsg_upref_sleepers` and `get_a_tx_buf` both acquire the same >>> mutex. i briefly considered to add `get_a_tx_buf` into `rpmsg_upref_sleepers` to >>> avoid locking the same mutex multiple times, though it adds a bit of complexity >>> to the implementation and harms readability a bit. >>> are there any recommendations on this topic or are (likely non-contended) locks >>> not expensive enough to justify the added complexity? >>> >>> thanks, >>> tim >>> >>> >>>> >>>>> >>>>> Regards, >>>>> Arnaud >>>>> >>>>>> >>>>>> in this case, so we re-try once before we really start to sleep >>>>>> >>>>>> Signed-off-by: Tim Blechmann <tim@xxxxxxxxxx> >>>>>> --- >>>>>> drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- >>>>>> 1 file changed, 15 insertions(+), 9 deletions(-) >>>>>> >>>>>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c >>>>>> b/drivers/rpmsg/virtio_rpmsg_bus.c >>>>>> index 905ac7910c98..2a9d42225e60 100644 >>>>>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c >>>>>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c >>>>>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct >>>>>> rpmsg_device *rpdev, >>>>>> /* no free buffer ? wait for one (but bail after 15 seconds) */ >>>>>> while (!msg) { >>>>>> /* enable "tx-complete" interrupts, if not already enabled */ >>>>>> rpmsg_upref_sleepers(vrp); >>>>>> - /* >>>>>> - * sleep until a free buffer is available or 15 secs elapse. >>>>>> - * the timeout period is not configurable because there's >>>>>> - * little point in asking drivers to specify that. >>>>>> - * if later this happens to be required, it'd be easy to add. >>>>>> - */ >>>>>> - err = wait_event_interruptible_timeout(vrp->sendq, >>>>>> - (msg = get_a_tx_buf(vrp)), >>>>>> - msecs_to_jiffies(15000)); >>>>>> + /* make sure to retry to grab tx buffer before we start waiting */ >>>>>> + msg = get_a_tx_buf(vrp); >>>>>> + if (msg) { >>>>>> + err = 0; >>>>>> + } else { >>>>>> + /* >>>>>> + * sleep until a free buffer is available or 15 secs elapse. >>>>>> + * the timeout period is not configurable because there's >>>>>> + * little point in asking drivers to specify that. >>>>>> + * if later this happens to be required, it'd be easy to add. >>>>>> + */ >>>>>> + err = wait_event_interruptible_timeout(vrp->sendq, >>>>>> + (msg = get_a_tx_buf(vrp)), >>>>>> + msecs_to_jiffies(15000)); >>>>>> + } >>>>>> /* disable "tx-complete" interrupts if we're the last sleeper */ >>>>>> rpmsg_downref_sleepers(vrp); >>>>>> /* timeout ? */ >>>>>> if (!err) { >>>> >>> >>