Re: [PATCH 1/1] rpmsg: virtio_rpmsg_bus - prevent possible race condition

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers`
enables tx-complete interrupt.
however if the interrupt is executed after `get_a_tx_buf` and before
`rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep
for the full 15 seconds.


Is there any reason why your co-processor is unable to release the TX RPMSG
buffers for 15 seconds? If not, you should first determine the reason why it is
stalled.

Arnaud's concern is valid.  If the remote processor can't consume a buffer
within 15 seconds, something is probably wrong.

That said, I believe your assesment of the situation is correct.  *If* the TX
callback is disabled and there is no buffer available, there is a window of
opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an
interrupt to arrive in function rpmsg_send_offchannel_raw().

the remote processor certainly releases the tx buffer and according to my
tracing the `vring_interrupt` fires immediately before `rpmsg_send` enters the
`rpmsg_upref_sleepers`.


If I well understood your point, the issue occur in following race condition

- all the Tx buffers are used
- in rpmsg_send_offchannel_raw() function, we try to get a buffer using
get_a_tx_buf(vrp) that returns NULL
- rpmsg_xmit_done is called as a Tx buffer is released by the remote processor
   and now free
- in rpmsg_send_offchannel_raw() rpmsg_upref_sleepers is called

At this point you are nothing happen until 15 second because rpmsg_xmit_done is
never called  again that would wake up the waitqueue to call get_a_tx_buf()

yes, this is the behaviour that i've been seeing:

virtio emits a few times:
[  404.XXXXXX] No more buffers in queue
and then
[  404.588722] remoteproc remoteproc0: vq index 1 is interrupted
[  404.597249] virtqueue interrupt with no work for 6d53f13a

If yes what is not clear to me is that wait_event_interruptible_timeout() seems
to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. A
free TX buffer should be found at this step.

[1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534

hmm, interesting. but that would imply a timing issue where calling get_a_tx_buf twice somehow makes a difference as opposed to calling it only once.

would the "interrupt with no work" point to a different cause for the issue that we are seeing?


Regards,
Arnaud


after applying this patch we haven't been able to reproduce the 15s timeout
anymore, whereas before we could easily reproduce it with certain workloads.

3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to
soak.  No error are locks are reported due to this patch during that time.

mentioning locks: i was a bit uncertain about a good way to implement the retry,
since both `rpmsg_upref_sleepers` and `get_a_tx_buf` both acquire the same
mutex. i briefly considered to add `get_a_tx_buf` into `rpmsg_upref_sleepers` to
avoid locking the same mutex multiple times, though it adds a bit of complexity
to the implementation and harms readability a bit.
are there any recommendations on this topic or are (likely non-contended) locks
not expensive enough to justify the added complexity?

thanks,
tim




Regards,
Arnaud


in this case, so we re-try once before we really start to sleep

Signed-off-by: Tim Blechmann <tim@xxxxxxxxxx>
---
   drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++---------
   1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c
b/drivers/rpmsg/virtio_rpmsg_bus.c
index 905ac7910c98..2a9d42225e60 100644
--- a/drivers/rpmsg/virtio_rpmsg_bus.c
+++ b/drivers/rpmsg/virtio_rpmsg_bus.c
@@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct
rpmsg_device *rpdev,
         /* no free buffer ? wait for one (but bail after 15 seconds) */
       while (!msg) {
           /* enable "tx-complete" interrupts, if not already enabled */
           rpmsg_upref_sleepers(vrp);
   -        /*
-         * sleep until a free buffer is available or 15 secs elapse.
-         * the timeout period is not configurable because there's
-         * little point in asking drivers to specify that.
-         * if later this happens to be required, it'd be easy to add.
-         */
-        err = wait_event_interruptible_timeout(vrp->sendq,
-                    (msg = get_a_tx_buf(vrp)),
-                    msecs_to_jiffies(15000));
+        /* make sure to retry to grab tx buffer before we start waiting */
+        msg = get_a_tx_buf(vrp);
+        if (msg) {
+            err = 0;
+        } else {
+            /*
+             * sleep until a free buffer is available or 15 secs elapse.
+             * the timeout period is not configurable because there's
+             * little point in asking drivers to specify that.
+             * if later this happens to be required, it'd be easy to add.
+             */
+            err = wait_event_interruptible_timeout(vrp->sendq,
+                        (msg = get_a_tx_buf(vrp)),
+                        msecs_to_jiffies(15000));
+        }
             /* disable "tx-complete" interrupts if we're the last sleeper */
           rpmsg_downref_sleepers(vrp);
             /* timeout ? */
           if (!err) {



Attachment: OpenPGP_signature
Description: OpenPGP digital signature


[Index of Archives]     [Linux Sound]     [ALSA Users]     [ALSA Devel]     [Linux Audio Users]     [Linux Media]     [Kernel]     [Photo Sharing]     [Gimp]     [Yosemite News]     [Linux Media]

  Powered by Linux