Re: [PATCH 1/1] rpmsg: virtio_rpmsg_bus - prevent possible race condition

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 9/9/23 08:28, Tim Blechmann wrote:
>>>>>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers`
>>>>>> enables tx-complete interrupt.
>>>>>> however if the interrupt is executed after `get_a_tx_buf` and before
>>>>>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep
>>>>>> for the full 15 seconds.
>>>>>
>>>>>
>>>>> Is there any reason why your co-processor is unable to release the TX RPMSG
>>>>> buffers for 15 seconds? If not, you should first determine the reason why
>>>>> it is
>>>>> stalled.
>>>>
>>>> Arnaud's concern is valid.  If the remote processor can't consume a buffer
>>>> within 15 seconds, something is probably wrong.
>>>>
>>>> That said, I believe your assesment of the situation is correct.  *If* the TX
>>>> callback is disabled and there is no buffer available, there is a window of
>>>> opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for an
>>>> interrupt to arrive in function rpmsg_send_offchannel_raw().
>>>
>>> the remote processor certainly releases the tx buffer and according to my
>>> tracing the `vring_interrupt` fires immediately before `rpmsg_send` enters the
>>> `rpmsg_upref_sleepers`.
>>
>>
>> If I well understood your point, the issue occur in following race condition
>>
>> - all the Tx buffers are used
>> - in rpmsg_send_offchannel_raw() function, we try to get a buffer using
>> get_a_tx_buf(vrp) that returns NULL
>> - rpmsg_xmit_done is called as a Tx buffer is released by the remote processor
>>    and now free
>> - in rpmsg_send_offchannel_raw() rpmsg_upref_sleepers is called
>>
>> At this point you are nothing happen until 15 second because rpmsg_xmit_done is
>> never called  again that would wake up the waitqueue to call get_a_tx_buf()
> 
> yes, this is the behaviour that i've been seeing:
> 
> virtio emits a few times:
> [  404.XXXXXX] No more buffers in queue
> and then
> [  404.588722] remoteproc remoteproc0: vq index 1 is interrupted
> [  404.597249] virtqueue interrupt with no work for 6d53f13a

Difficult to understand the scheduling with your trace. Could you enable ftrace to
observe it (mailbox interrupts,rpmsg and virtio functions)
> 
>> If yes what is not clear to me is that wait_event_interruptible_timeout() seems
>> to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. A
>> free TX buffer should be found at this step.
>>
>> [1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534
> 
> hmm, interesting. but that would imply a timing issue where calling get_a_tx_buf
> twice somehow makes a difference as opposed to calling it only once.
> 
> would the "interrupt with no work" point to a different cause for the issue that
> we are seeing?

Look like you have not released the buffer on remote side so no buffer in vring
used list.

Do you use the zero copy (rpmsg_hold_rx_buffer) on remote processor?
If yes, please check that you have following fix:
https://github.com/OpenAMP/open-amp/commit/2b6a38a9c6890869960ed41d6a1dc9930c079e0d

> 
>>
>> Regards,
>> Arnaud
>>
>>>
>>> after applying this patch we haven't been able to reproduce the 15s timeout
>>> anymore, whereas before we could easily reproduce it with certain workloads.
>>>
>>>> 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks to
>>>> soak.  No error are locks are reported due to this patch during that time.
>>>
>>> mentioning locks: i was a bit uncertain about a good way to implement the retry,
>>> since both `rpmsg_upref_sleepers` and `get_a_tx_buf` both acquire the same
>>> mutex. i briefly considered to add `get_a_tx_buf` into `rpmsg_upref_sleepers` to
>>> avoid locking the same mutex multiple times, though it adds a bit of complexity
>>> to the implementation and harms readability a bit.
>>> are there any recommendations on this topic or are (likely non-contended) locks
>>> not expensive enough to justify the added complexity?
>>>
>>> thanks,
>>> tim
>>>
>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Arnaud
>>>>>
>>>>>>
>>>>>> in this case, so we re-try once before we really start to sleep
>>>>>>
>>>>>> Signed-off-by: Tim Blechmann <tim@xxxxxxxxxx>
>>>>>> ---
>>>>>>    drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++---------
>>>>>>    1 file changed, 15 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>>> b/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>>> index 905ac7910c98..2a9d42225e60 100644
>>>>>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct
>>>>>> rpmsg_device *rpdev,
>>>>>>          /* no free buffer ? wait for one (but bail after 15 seconds) */
>>>>>>        while (!msg) {
>>>>>>            /* enable "tx-complete" interrupts, if not already enabled */
>>>>>>            rpmsg_upref_sleepers(vrp);
>>>>>>    -        /*
>>>>>> -         * sleep until a free buffer is available or 15 secs elapse.
>>>>>> -         * the timeout period is not configurable because there's
>>>>>> -         * little point in asking drivers to specify that.
>>>>>> -         * if later this happens to be required, it'd be easy to add.
>>>>>> -         */
>>>>>> -        err = wait_event_interruptible_timeout(vrp->sendq,
>>>>>> -                    (msg = get_a_tx_buf(vrp)),
>>>>>> -                    msecs_to_jiffies(15000));
>>>>>> +        /* make sure to retry to grab tx buffer before we start waiting */
>>>>>> +        msg = get_a_tx_buf(vrp);
>>>>>> +        if (msg) {
>>>>>> +            err = 0;
>>>>>> +        } else {
>>>>>> +            /*
>>>>>> +             * sleep until a free buffer is available or 15 secs elapse.
>>>>>> +             * the timeout period is not configurable because there's
>>>>>> +             * little point in asking drivers to specify that.
>>>>>> +             * if later this happens to be required, it'd be easy to add.
>>>>>> +             */
>>>>>> +            err = wait_event_interruptible_timeout(vrp->sendq,
>>>>>> +                        (msg = get_a_tx_buf(vrp)),
>>>>>> +                        msecs_to_jiffies(15000));
>>>>>> +        }
>>>>>>              /* disable "tx-complete" interrupts if we're the last sleeper */
>>>>>>            rpmsg_downref_sleepers(vrp);
>>>>>>              /* timeout ? */
>>>>>>            if (!err) {
>>>>
>>>
>>



[Index of Archives]     [Linux Sound]     [ALSA Users]     [ALSA Devel]     [Linux Audio Users]     [Linux Media]     [Kernel]     [Photo Sharing]     [Gimp]     [Yosemite News]     [Linux Media]

  Powered by Linux