Re: [PATCH 3/6] io_uring: support buffer selection

Jens Axboe <axboe@xxxxxxxxx> · Tue, 10 Mar 2020 07:37:44 -0600

On 3/9/20 11:21 AM, Andres Freund wrote:
> Hi,
> 
> On 2020-02-28 13:30:50 -0700, Jens Axboe wrote:
>> If a server process has tons of pending socket connections, generally
>> it uses epoll to wait for activity. When the socket is ready for reading
>> (or writing), the task can select a buffer and issue a recv/send on the
>> given fd.
>>
>> Now that we have fast (non-async thread) support, a task can have tons
>> of pending reads or writes pending. But that means they need buffers to
>> back that data, and if the number of connections is high enough, having
>> them preallocated for all possible connections is unfeasible.
>>
>> With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
>> use for any request. The request then sets IOSQE_BUFFER_SELECT in the
>> sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
>> a free buffer from the specified group is selected. If none are
>> available, the request is terminated with -ENOBUFS. If successful, the
>> CQE on completion will contain the buffer ID chosen in the cqe->flags
>> member, encoded as:
>>
>> 	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
>>
>> Once a buffer has been consumed by a request, it is no longer available
>> and must be registered again with IORING_OP_PROVIDE_BUFFERS.
>>
>> Requests need to support this feature. For now, IORING_OP_READ and
>> IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
>> res == -EINVAL will be posted if attempted on unsupported requests.
> 
> Why not EOPNOTSUPP or such? Makes it more feasible for applications to
> handle the case separately.

Good point, I can make that change.

>> +static int io_rw_common_cflags(struct io_kiocb *req)
>> +{
>> +	struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
>> +	int cflags;
>> +
>> +	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
>> +	cflags |= IORING_CQE_F_BUFFER;
>> +	req->rw.addr = 0;
>> +	kfree(kbuf);
>> +	return cflags;
>> +}
> 
>>  		if (refcount_dec_and_test(&req->refs) &&
>> @@ -1819,13 +1860,16 @@ static inline void req_set_fail_links(struct io_kiocb *req)
>>  static void io_complete_rw_common(struct kiocb *kiocb, long res)
>>  {
>>  	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
>> +	int cflags = 0;
>>  
>>  	if (kiocb->ki_flags & IOCB_WRITE)
>>  		kiocb_end_write(req);
>>  
>>  	if (res != req->result)
>>  		req_set_fail_links(req);
>> -	io_cqring_add_event(req, res);
>> +	if (req->flags & REQ_F_BUFFER_SELECTED)
>> +		cflags = io_rw_common_cflags(req);
>> +	__io_cqring_add_event(req, res, cflags);
>>  }
> 
> Besides the naming already commented upon by Pavel, I'm also wondering
> if it's the right thing to call this unconditionally from
> io_complete_*rw*_common() - hard to see how this feature would ever be
> used in the write path...

Doesn't really matter I think, I'd rather have that dead branch for
writes than needing a separate handler. I did change the naming, this
posting is almost two weeks old. I'll change the other little bits from
here and post a new series so we're all on the same page.

>> +static struct io_buffer *io_buffer_select(struct io_kiocb *req, int gid,
>> +					  struct io_buffer *kbuf,
>> +					  bool needs_lock)
>> +{
>> +	struct list_head *list;
>> +
>> +	if (req->flags & REQ_F_BUFFER_SELECTED)
>> +		return kbuf;
>> +
>> +	/*
>> +	 * "Normal" inline submissions always hold the uring_lock, since we
>> +	 * grab it from the system call. Same is true for the SQPOLL offload.
>> +	 * The only exception is when we've detached the request and issue it
>> +	 * from an async worker thread, grab the lock for that case.
>> +	 */
>> +	if (needs_lock)
>> +		mutex_lock(&req->ctx->uring_lock);
>> +
>> +	lockdep_assert_held(&req->ctx->uring_lock);
> 
> This comment is in a few places, perhaps there's a way to unify by
> placing the conditional acquisition into a helper?

We could have a io_lock_ring(ctx, force_nonblock) helper and just put it
in there, ditto for the unlock.

>> +	list = idr_find(&req->ctx->io_buffer_idr, gid);
>> +	if (list && !list_empty(list)) {
>> +		kbuf = list_first_entry(list, struct io_buffer, list);
>> +		list_del(&kbuf->list);
>> +	} else {
>> +		kbuf = ERR_PTR(-ENOBUFS);
>> +	}
>> +
>> +	if (needs_lock)
>> +		mutex_unlock(&req->ctx->uring_lock);
>> +
>> +	return kbuf;
>> +}
>> +
>>  static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>> -			       struct iovec **iovec, struct iov_iter *iter)
>> +			       struct iovec **iovec, struct iov_iter *iter,
>> +			       bool needs_lock)
>>  {
>>  	void __user *buf = u64_to_user_ptr(req->rw.addr);
>>  	size_t sqe_len = req->rw.len;
>> @@ -2140,12 +2219,30 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
>>  		return io_import_fixed(req, rw, iter);
>>  	}
>>  
>> -	/* buffer index only valid with fixed read/write */
>> -	if (req->rw.kiocb.private)
>> +	/* buffer index only valid with fixed read/write, or buffer select  */
>> +	if (req->rw.kiocb.private && !(req->flags & REQ_F_BUFFER_SELECT))
>>  		return -EINVAL;
>>  
>>  	if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) {
>>  		ssize_t ret;
>> +
>> +		if (req->flags & REQ_F_BUFFER_SELECT) {
>> +			struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr;
>> +			int gid;
>> +
>> +			gid = (int) (unsigned long) req->rw.kiocb.private;
>> +			kbuf = io_buffer_select(req, gid, kbuf, needs_lock);
>> +			if (IS_ERR(kbuf)) {
>> +				*iovec = NULL;
>> +				return PTR_ERR(kbuf);
>> +			}
>> +			req->rw.addr = (u64) kbuf;
>> +			if (sqe_len > kbuf->len)
>> +				sqe_len = kbuf->len;
>> +			req->flags |= REQ_F_BUFFER_SELECTED;
>> +			buf = u64_to_user_ptr(kbuf->addr);
>> +		}
> 
> Feels a bit dangerous to have addr sometimes pointing to the user
> specified data, and sometimes to kernel data. Even if indicated by
> REQ_F_BUFFER_SELECTED.

It's not ideal, but it's either that or have the struct io_rw blow over
the cacheline, which I don't want. So the tradeoff seemed like the right
one to me. All the initial io_kiocb per-request-type unions are 64b or
less.

>> +static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req,
>> +					       int *cflags, bool needs_lock)
>> +{
>> +	struct io_sr_msg *sr = &req->sr_msg;
>> +	struct io_buffer *kbuf;
>> +
>> +	if (!(req->flags & REQ_F_BUFFER_SELECT))
>> +		return NULL;
>> +
>> +	kbuf = io_buffer_select(req, sr->gid, sr->kbuf, needs_lock);
>> +	if (IS_ERR(kbuf))
>> +		return kbuf;
>> +
>> +	sr->kbuf = kbuf;
>> +	if (sr->len > kbuf->len)
>> +		sr->len = kbuf->len;
>> +	req->flags |= REQ_F_BUFFER_SELECTED;
>> +
>> +	*cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
>> +	*cflags |= IORING_CQE_F_BUFFER;
>> +	return kbuf;
>> +}
> 
> Could more of this be moved into io_buffer_select? Looks like every
> REQ_F_BUFFER_SELECT supporting op is going to need most of it?

Probably could be, I'll take a look and see if we can move more of that
logic in there.

>>  static int io_recvmsg_prep(struct io_kiocb *req,
>>  			   const struct io_uring_sqe *sqe)
>>  {
> 
> Looks like this would be unused if !defined(CONFIG_NET)?

Also fixed a week ago or so.

-- 
Jens Axboe