Re: [PATCH 8/8] io_uring/net: set MSG_MORE if we're doing multishot send and have more

Pavel Begunkov <asml.silence@xxxxxxxxx> · Mon, 26 Feb 2024 14:24:50 +0000

On 2/26/24 13:42, Jens Axboe wrote:
On 2/26/24 3:59 AM, Dylan Yudaken wrote:
On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@xxxxxxxxx> wrote:

If we have more data pending, we know we're going to do one more loop.
If that's the case, then set MSG_MORE to inform the networking stack
that there's more data coming shortly for this socket.

Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
---
  io_uring/net.c | 10 +++++++---
  1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 240b8eff1a78..07307dd5a077 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -519,6 +519,10 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
         if (!io_check_multishot(req, issue_flags))
                 return io_setup_async_msg(req, kmsg, issue_flags);

+       flags = sr->msg_flags;
+       if (issue_flags & IO_URING_F_NONBLOCK)
+               flags |= MSG_DONTWAIT;
+
  retry_multishot:
         if (io_do_buffer_select(req)) {
                 void __user *buf;
@@ -528,12 +532,12 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
                 if (!buf)
                         return -ENOBUFS;

+               if ((req->flags & (REQ_F_BL_EMPTY|REQ_F_APOLL_MULTISHOT)) ==
+                                  REQ_F_APOLL_MULTISHOT)
+                       flags |= MSG_MORE;
                 iov_iter_ubuf(&kmsg->msg.msg_iter, ITER_SOURCE, buf, len);
         }

This feels racy. I don't have an exact sequence in mind, but I believe
there are cases where between
the two calls to __sys_sendmsg_sock, another submission could be
issued and drain the buffer list.
I guess the result would be that the packet is never sent out, but I
have not followed the codepaths of MSG_MORE.

This is true, but that race always exists depending on how gets to go
first (the adding of the buffer, or the send itself). The way I see it,
when the send is issued we're making the guarantee that we're going to
at least deplete the queue as it looks when entered. If more is added
while it's being processed, we _may_ see it.

Outside of that, we don't want it to potentially run in perpetuity. It
may actually be a good idea to make the rule of "just issue what was
there when first seen/issued" a hard one, though I don't think it's
really worth doing. But making any guarantees on buffers added in
parallel will be impossible. If you do that, then you have to deal with
figuring out what's left in the queue once you get a completion withou
CQE_F_MORE.

The obvious other way to trigger this codepath is if the user messes
with the ring by decrementing
the buffer counter. I do not believe there are any nefarious outcomes
- but just to point out that
REQ_F_BL_EMPTY is essentially user controlled.

The user may certainly shoot himself in the foot. As long as that
doesn't lead to a nefarious outcome, then that's not a concern. For this
case, the head is kernel local, user can only write to the tail. So we
could have a case of user fiddling with the tail and when we grab the
next buffer (and the previous one did not have REQ_F_BL_EMPTY set), the
ring will indeed appear to be empty. At that point you get an -ENOBUFS
without CQE_F_MORE set.

A side note, don't forget that there are other protocols apart
from TCP. AFAIK UDP corking will pack it into a single datagram,
which is not the same as two separate sends.

--
Pavel Begunkov