On 9/5/23 03:33, Tim Blechmann wrote: >>>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` >>>> enables tx-complete interrupt. >>>> however if the interrupt is executed after `get_a_tx_buf` and before >>>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep >>>> for the full 15 seconds. >>> >>> >>> Is there any reason why your co-processor is unable to release the TX RPMSG >>> buffers for 15 seconds? If not, you should first determine the reason why it is >>> stalled. >> >> Arnaud's concern is valid. If the remote processor can't consume a buffer >> within 15 seconds, something is probably wrong. >> >> That said, I believe your assesment of the situation is correct. *If* the TX >> callback is disabled and there is no buffer available, there is a window of >> opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an >> interrupt to arrive in function rpmsg_send_offchannel_raw(). > > the remote processor certainly releases the tx buffer and according to my > tracing the `vring_interrupt` fires immediately before `rpmsg_send` enters the > `rpmsg_upref_sleepers`. If I well understood your point, the issue occur in following race condition - all the Tx buffers are used - in rpmsg_send_offchannel_raw() function, we try to get a buffer using get_a_tx_buf(vrp) that returns NULL - rpmsg_xmit_done is called as a Tx buffer is released by the remote processor and now free - in rpmsg_send_offchannel_raw() rpmsg_upref_sleepers is called At this point you are nothing happen until 15 second because rpmsg_xmit_done is never called again that would wake up the waitqueue to call get_a_tx_buf() I'm right? If yes what is not clear to me is that wait_event_interruptible_timeout() seems to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. A free TX buffer should be found at this step. [1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534 Regards, Arnaud > > after applying this patch we haven't been able to reproduce the 15s timeout > anymore, whereas before we could easily reproduce it with certain workloads. > >> 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to >> soak. No error are locks are reported due to this patch during that time. > > mentioning locks: i was a bit uncertain about a good way to implement the retry, > since both `rpmsg_upref_sleepers` and `get_a_tx_buf` both acquire the same > mutex. i briefly considered to add `get_a_tx_buf` into `rpmsg_upref_sleepers` to > avoid locking the same mutex multiple times, though it adds a bit of complexity > to the implementation and harms readability a bit. > are there any recommendations on this topic or are (likely non-contended) locks > not expensive enough to justify the added complexity? > > thanks, > tim > > >> >>> >>> Regards, >>> Arnaud >>> >>>> >>>> in this case, so we re-try once before we really start to sleep >>>> >>>> Signed-off-by: Tim Blechmann <tim@xxxxxxxxxx> >>>> --- >>>> drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- >>>> 1 file changed, 15 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c >>>> b/drivers/rpmsg/virtio_rpmsg_bus.c >>>> index 905ac7910c98..2a9d42225e60 100644 >>>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c >>>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c >>>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct >>>> rpmsg_device *rpdev, >>>> /* no free buffer ? wait for one (but bail after 15 seconds) */ >>>> while (!msg) { >>>> /* enable "tx-complete" interrupts, if not already enabled */ >>>> rpmsg_upref_sleepers(vrp); >>>> - /* >>>> - * sleep until a free buffer is available or 15 secs elapse. >>>> - * the timeout period is not configurable because there's >>>> - * little point in asking drivers to specify that. >>>> - * if later this happens to be required, it'd be easy to add. >>>> - */ >>>> - err = wait_event_interruptible_timeout(vrp->sendq, >>>> - (msg = get_a_tx_buf(vrp)), >>>> - msecs_to_jiffies(15000)); >>>> + /* make sure to retry to grab tx buffer before we start waiting */ >>>> + msg = get_a_tx_buf(vrp); >>>> + if (msg) { >>>> + err = 0; >>>> + } else { >>>> + /* >>>> + * sleep until a free buffer is available or 15 secs elapse. >>>> + * the timeout period is not configurable because there's >>>> + * little point in asking drivers to specify that. >>>> + * if later this happens to be required, it'd be easy to add. >>>> + */ >>>> + err = wait_event_interruptible_timeout(vrp->sendq, >>>> + (msg = get_a_tx_buf(vrp)), >>>> + msecs_to_jiffies(15000)); >>>> + } >>>> /* disable "tx-complete" interrupts if we're the last sleeper */ >>>> rpmsg_downref_sleepers(vrp); >>>> /* timeout ? */ >>>> if (!err) { >> >