On 3/9/20 11:21 AM, Andres Freund wrote: > Hi, > > On 2020-02-28 13:30:50 -0700, Jens Axboe wrote: >> If a server process has tons of pending socket connections, generally >> it uses epoll to wait for activity. When the socket is ready for reading >> (or writing), the task can select a buffer and issue a recv/send on the >> given fd. >> >> Now that we have fast (non-async thread) support, a task can have tons >> of pending reads or writes pending. But that means they need buffers to >> back that data, and if the number of connections is high enough, having >> them preallocated for all possible connections is unfeasible. >> >> With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to >> use for any request. The request then sets IOSQE_BUFFER_SELECT in the >> sqe, and a given group ID in sqe->buf_group. When the fd becomes ready, >> a free buffer from the specified group is selected. If none are >> available, the request is terminated with -ENOBUFS. If successful, the >> CQE on completion will contain the buffer ID chosen in the cqe->flags >> member, encoded as: >> >> (buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER; >> >> Once a buffer has been consumed by a request, it is no longer available >> and must be registered again with IORING_OP_PROVIDE_BUFFERS. >> >> Requests need to support this feature. For now, IORING_OP_READ and >> IORING_OP_RECV support it. This is checked on SQE submission, a CQE with >> res == -EINVAL will be posted if attempted on unsupported requests. > > Why not EOPNOTSUPP or such? Makes it more feasible for applications to > handle the case separately. Good point, I can make that change. >> +static int io_rw_common_cflags(struct io_kiocb *req) >> +{ >> + struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr; >> + int cflags; >> + >> + cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT; >> + cflags |= IORING_CQE_F_BUFFER; >> + req->rw.addr = 0; >> + kfree(kbuf); >> + return cflags; >> +} > >> if (refcount_dec_and_test(&req->refs) && >> @@ -1819,13 +1860,16 @@ static inline void req_set_fail_links(struct io_kiocb *req) >> static void io_complete_rw_common(struct kiocb *kiocb, long res) >> { >> struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); >> + int cflags = 0; >> >> if (kiocb->ki_flags & IOCB_WRITE) >> kiocb_end_write(req); >> >> if (res != req->result) >> req_set_fail_links(req); >> - io_cqring_add_event(req, res); >> + if (req->flags & REQ_F_BUFFER_SELECTED) >> + cflags = io_rw_common_cflags(req); >> + __io_cqring_add_event(req, res, cflags); >> } > > Besides the naming already commented upon by Pavel, I'm also wondering > if it's the right thing to call this unconditionally from > io_complete_*rw*_common() - hard to see how this feature would ever be > used in the write path... Doesn't really matter I think, I'd rather have that dead branch for writes than needing a separate handler. I did change the naming, this posting is almost two weeks old. I'll change the other little bits from here and post a new series so we're all on the same page. >> +static struct io_buffer *io_buffer_select(struct io_kiocb *req, int gid, >> + struct io_buffer *kbuf, >> + bool needs_lock) >> +{ >> + struct list_head *list; >> + >> + if (req->flags & REQ_F_BUFFER_SELECTED) >> + return kbuf; >> + >> + /* >> + * "Normal" inline submissions always hold the uring_lock, since we >> + * grab it from the system call. Same is true for the SQPOLL offload. >> + * The only exception is when we've detached the request and issue it >> + * from an async worker thread, grab the lock for that case. >> + */ >> + if (needs_lock) >> + mutex_lock(&req->ctx->uring_lock); >> + >> + lockdep_assert_held(&req->ctx->uring_lock); > > This comment is in a few places, perhaps there's a way to unify by > placing the conditional acquisition into a helper? We could have a io_lock_ring(ctx, force_nonblock) helper and just put it in there, ditto for the unlock. >> + list = idr_find(&req->ctx->io_buffer_idr, gid); >> + if (list && !list_empty(list)) { >> + kbuf = list_first_entry(list, struct io_buffer, list); >> + list_del(&kbuf->list); >> + } else { >> + kbuf = ERR_PTR(-ENOBUFS); >> + } >> + >> + if (needs_lock) >> + mutex_unlock(&req->ctx->uring_lock); >> + >> + return kbuf; >> +} >> + >> static ssize_t io_import_iovec(int rw, struct io_kiocb *req, >> - struct iovec **iovec, struct iov_iter *iter) >> + struct iovec **iovec, struct iov_iter *iter, >> + bool needs_lock) >> { >> void __user *buf = u64_to_user_ptr(req->rw.addr); >> size_t sqe_len = req->rw.len; >> @@ -2140,12 +2219,30 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, >> return io_import_fixed(req, rw, iter); >> } >> >> - /* buffer index only valid with fixed read/write */ >> - if (req->rw.kiocb.private) >> + /* buffer index only valid with fixed read/write, or buffer select */ >> + if (req->rw.kiocb.private && !(req->flags & REQ_F_BUFFER_SELECT)) >> return -EINVAL; >> >> if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) { >> ssize_t ret; >> + >> + if (req->flags & REQ_F_BUFFER_SELECT) { >> + struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr; >> + int gid; >> + >> + gid = (int) (unsigned long) req->rw.kiocb.private; >> + kbuf = io_buffer_select(req, gid, kbuf, needs_lock); >> + if (IS_ERR(kbuf)) { >> + *iovec = NULL; >> + return PTR_ERR(kbuf); >> + } >> + req->rw.addr = (u64) kbuf; >> + if (sqe_len > kbuf->len) >> + sqe_len = kbuf->len; >> + req->flags |= REQ_F_BUFFER_SELECTED; >> + buf = u64_to_user_ptr(kbuf->addr); >> + } > > Feels a bit dangerous to have addr sometimes pointing to the user > specified data, and sometimes to kernel data. Even if indicated by > REQ_F_BUFFER_SELECTED. It's not ideal, but it's either that or have the struct io_rw blow over the cacheline, which I don't want. So the tradeoff seemed like the right one to me. All the initial io_kiocb per-request-type unions are 64b or less. >> +static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req, >> + int *cflags, bool needs_lock) >> +{ >> + struct io_sr_msg *sr = &req->sr_msg; >> + struct io_buffer *kbuf; >> + >> + if (!(req->flags & REQ_F_BUFFER_SELECT)) >> + return NULL; >> + >> + kbuf = io_buffer_select(req, sr->gid, sr->kbuf, needs_lock); >> + if (IS_ERR(kbuf)) >> + return kbuf; >> + >> + sr->kbuf = kbuf; >> + if (sr->len > kbuf->len) >> + sr->len = kbuf->len; >> + req->flags |= REQ_F_BUFFER_SELECTED; >> + >> + *cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT; >> + *cflags |= IORING_CQE_F_BUFFER; >> + return kbuf; >> +} > > Could more of this be moved into io_buffer_select? Looks like every > REQ_F_BUFFER_SELECT supporting op is going to need most of it? Probably could be, I'll take a look and see if we can move more of that logic in there. >> static int io_recvmsg_prep(struct io_kiocb *req, >> const struct io_uring_sqe *sqe) >> { > > Looks like this would be unused if !defined(CONFIG_NET)? Also fixed a week ago or so. -- Jens Axboe