Re: [PATCH 8/8] io_uring/net: set MSG_MORE if we're doing multishot send and have more

Jens Axboe <axboe@xxxxxxxxx> · Mon, 26 Feb 2024 07:52:46 -0700

On 2/26/24 7:24 AM, Pavel Begunkov wrote:
> On 2/26/24 13:42, Jens Axboe wrote:
>> On 2/26/24 3:59 AM, Dylan Yudaken wrote:
>>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>>
>>>> If we have more data pending, we know we're going to do one more loop.
>>>> If that's the case, then set MSG_MORE to inform the networking stack
>>>> that there's more data coming shortly for this socket.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>>>> ---
>>>>   io_uring/net.c | 10 +++++++---
>>>>   1 file changed, 7 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>>> index 240b8eff1a78..07307dd5a077 100644
>>>> --- a/io_uring/net.c
>>>> +++ b/io_uring/net.c
>>>> @@ -519,6 +519,10 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
>>>>          if (!io_check_multishot(req, issue_flags))
>>>>                  return io_setup_async_msg(req, kmsg, issue_flags);
>>>>
>>>> +       flags = sr->msg_flags;
>>>> +       if (issue_flags & IO_URING_F_NONBLOCK)
>>>> +               flags |= MSG_DONTWAIT;
>>>> +
>>>>   retry_multishot:
>>>>          if (io_do_buffer_select(req)) {
>>>>                  void __user *buf;
>>>> @@ -528,12 +532,12 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
>>>>                  if (!buf)
>>>>                          return -ENOBUFS;
>>>>
>>>> +               if ((req->flags & (REQ_F_BL_EMPTY|REQ_F_APOLL_MULTISHOT)) ==
>>>> +                                  REQ_F_APOLL_MULTISHOT)
>>>> +                       flags |= MSG_MORE;
>>>>                  iov_iter_ubuf(&kmsg->msg.msg_iter, ITER_SOURCE, buf, len);
>>>>          }
>>>
>>> This feels racy. I don't have an exact sequence in mind, but I believe
>>> there are cases where between
>>> the two calls to __sys_sendmsg_sock, another submission could be
>>> issued and drain the buffer list.
>>> I guess the result would be that the packet is never sent out, but I
>>> have not followed the codepaths of MSG_MORE.
>>
>> This is true, but that race always exists depending on how gets to go
>> first (the adding of the buffer, or the send itself). The way I see it,
>> when the send is issued we're making the guarantee that we're going to
>> at least deplete the queue as it looks when entered. If more is added
>> while it's being processed, we _may_ see it.
>>
>> Outside of that, we don't want it to potentially run in perpetuity. It
>> may actually be a good idea to make the rule of "just issue what was
>> there when first seen/issued" a hard one, though I don't think it's
>> really worth doing. But making any guarantees on buffers added in
>> parallel will be impossible. If you do that, then you have to deal with
>> figuring out what's left in the queue once you get a completion withou
>> CQE_F_MORE.
>>
>>> The obvious other way to trigger this codepath is if the user messes
>>> with the ring by decrementing
>>> the buffer counter. I do not believe there are any nefarious outcomes
>>> - but just to point out that
>>> REQ_F_BL_EMPTY is essentially user controlled.
>>
>> The user may certainly shoot himself in the foot. As long as that
>> doesn't lead to a nefarious outcome, then that's not a concern. For this
>> case, the head is kernel local, user can only write to the tail. So we
>> could have a case of user fiddling with the tail and when we grab the
>> next buffer (and the previous one did not have REQ_F_BL_EMPTY set), the
>> ring will indeed appear to be empty. At that point you get an -ENOBUFS
>> without CQE_F_MORE set.
> 
> A side note, don't forget that there are other protocols apart
> from TCP. AFAIK UDP corking will pack it into a single datagram,
> which is not the same as two separate sends.

Yeah, should really have labeled this one as a test/rfc kind of patch. I
wasn't even convinced we want to do this uncondtionally for TCP. I'll
just leave it at the end for now, it's a separate kind of discussion
imho and this is why it was left as a separate patch rather than being
bundled with the multishot send in general.

-- 
Jens Axboe