On Mon, 4 Sept 2023 at 19:33, Tim Blechmann <tim@xxxxxxxxxx> wrote: > > >>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers` > >>> enables tx-complete interrupt. > >>> however if the interrupt is executed after `get_a_tx_buf` and before > >>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep > >>> for the full 15 seconds. > >> > >> > >> Is there any reason why your co-processor is unable to release the TX RPMSG > >> buffers for 15 seconds? If not, you should first determine the reason why it is > >> stalled. > > > > Arnaud's concern is valid. If the remote processor can't consume a buffer > > within 15 seconds, something is probably wrong. > > > > That said, I believe your assesment of the situation is correct. *If* the TX > > callback is disabled and there is no buffer available, there is a window of > > opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an > > interrupt to arrive in function rpmsg_send_offchannel_raw(). > > the remote processor certainly releases the tx buffer and according to > my tracing the `vring_interrupt` fires immediately before `rpmsg_send` > enters the `rpmsg_upref_sleepers`. > > after applying this patch we haven't been able to reproduce the 15s > timeout anymore, whereas before we could easily reproduce it with > certain workloads. > > > 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to > > soak. No error are locks are reported due to this patch during that time. > > mentioning locks: i was a bit uncertain about a good way to implement > the retry, since both `rpmsg_upref_sleepers` and `get_a_tx_buf` both > acquire the same mutex. i briefly considered to add `get_a_tx_buf` into > `rpmsg_upref_sleepers` to avoid locking the same mutex multiple times, > though it adds a bit of complexity to the implementation and harms > readability a bit. The current implementation looks good to me. > are there any recommendations on this topic or are (likely > non-contended) locks not expensive enough to justify the added complexity? > Since we are changing how buffers are delivered to applications in a highly concurrent environment, it is possible for this patch to introduce unintended side effects, something I want to avoid as much as possible. I am not worried about contentions for the vrp->tx_lock. > thanks, > tim > > > > > >> > >> Regards, > >> Arnaud > >> > >>> > >>> in this case, so we re-try once before we really start to sleep > >>> > >>> Signed-off-by: Tim Blechmann <tim@xxxxxxxxxx> > >>> --- > >>> drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++--------- > >>> 1 file changed, 15 insertions(+), 9 deletions(-) > >>> > >>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c > >>> index 905ac7910c98..2a9d42225e60 100644 > >>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c > >>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c > >>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct rpmsg_device *rpdev, > >>> > >>> /* no free buffer ? wait for one (but bail after 15 seconds) */ > >>> while (!msg) { > >>> /* enable "tx-complete" interrupts, if not already enabled */ > >>> rpmsg_upref_sleepers(vrp); > >>> > >>> - /* > >>> - * sleep until a free buffer is available or 15 secs elapse. > >>> - * the timeout period is not configurable because there's > >>> - * little point in asking drivers to specify that. > >>> - * if later this happens to be required, it'd be easy to add. > >>> - */ > >>> - err = wait_event_interruptible_timeout(vrp->sendq, > >>> - (msg = get_a_tx_buf(vrp)), > >>> - msecs_to_jiffies(15000)); > >>> + /* make sure to retry to grab tx buffer before we start waiting */ > >>> + msg = get_a_tx_buf(vrp); > >>> + if (msg) { > >>> + err = 0; > >>> + } else { > >>> + /* > >>> + * sleep until a free buffer is available or 15 secs elapse. > >>> + * the timeout period is not configurable because there's > >>> + * little point in asking drivers to specify that. > >>> + * if later this happens to be required, it'd be easy to add. > >>> + */ > >>> + err = wait_event_interruptible_timeout(vrp->sendq, > >>> + (msg = get_a_tx_buf(vrp)), > >>> + msecs_to_jiffies(15000)); > >>> + } > >>> > >>> /* disable "tx-complete" interrupts if we're the last sleeper */ > >>> rpmsg_downref_sleepers(vrp); > >>> > >>> /* timeout ? */ > >>> if (!err) { > > >