Re: [PATCH 3/6] io_uring: add support for kernel registered bvecs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Feb 07, 2025 at 02:08:23PM +0000, Pavel Begunkov wrote:
> On 2/3/25 15:45, Keith Busch wrote:
> >   		struct io_rsrc_node *node;
> >   		u64 tag = 0;
> > +		i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
> > +		node = io_rsrc_node_lookup(&ctx->buf_table, i);
> > +		if (node && node->type != IORING_RSRC_BUFFER) {
> 
> We might need to rethink how it's unregistered. The next patch
> does it as a ublk commands, but what happens if it gets ejected
> by someone else?  get_page might protect from kernel corruption
> and here you try to forbid ejections, but there is io_rsrc_data_free()
> and the io_uring ctx can die as well and it will have to drop it.

We prevent clearing an index through the typical user register update
call. The expected way to clear for a well functioning program is
through the kernel interfaces.

Other than that, there's nothing special about kernel buffers here. You
can kill the ring or tear down registered buffer table, but that same
scenario exists for user registered buffers. The only thing io_uring
needs to ensure is that nothing gets corrupted. User registered buffers
hold a pin on the user pages while the node is referenced. Kernel
registered buffers hold a page reference while the node is referenced.
Nothing special.

> And then you don't really have clear ownership rules. Does ublk
> releases the block request and "returns ownership" over pages to
> its user while io_uring is still dying and potenially have some
> IO inflight against it?
> 
> That's why I liked more the option to allow removing buffers from
> the table as per usual io_uring api / rules instead of a separate
> unregister ublk cmd. 

ublk is the only entity that knows about the struct request that
provides the bvec we want to use for zero-copy, so it has to be ublk
that handles the registration. Moving the unregister outside of that
breaks the symmetry and requires an indirect call.

> And inside, when all node refs are dropped,
> it'd call back to ublk. This way you have a single mechanism of
> how buffers are dropped from io_uring perspective. Thoughts?
>
> > +			err = -EBUSY;
> > +			break;
> > +		}
> > +
> >   		uvec = u64_to_user_ptr(user_data);
> >   		iov = iovec_from_user(uvec, 1, 1, &fast_iov, ctx->compat);
> >   		if (IS_ERR(iov)) {
> > @@ -258,6 +268,7 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
> >   			err = PTR_ERR(node);
> >   			break;
> >   		}
> ...
> > +int io_buffer_register_bvec(struct io_ring_ctx *ctx, const struct request *rq,
> > +			    unsigned int index)
> > +{
> > +	struct io_rsrc_data *data = &ctx->buf_table;
> > +	u16 nr_bvecs = blk_rq_nr_phys_segments(rq);
> > +	struct req_iterator rq_iter;
> > +	struct io_rsrc_node *node;
> > +	struct bio_vec bv;
> > +	int i = 0;
> > +
> > +	lockdep_assert_held(&ctx->uring_lock);
> > +
> > +	if (WARN_ON_ONCE(!data->nr))
> > +		return -EINVAL;
> 
> IIUC you can trigger all these from the user space, so they
> can't be warnings. Likely same goes for unregister*()

It helped with debugging, but sure, the warns don't need to be there.

> > +	if (WARN_ON_ONCE(index >= data->nr))
> > +		return -EINVAL;
> > +
> > +	node = data->nodes[index];
> > +	if (WARN_ON_ONCE(node))
> > +		return -EBUSY;
> > +
> > +	node = io_buffer_alloc_node(ctx, nr_bvecs, blk_rq_bytes(rq));
> > +	if (!node)
> > +		return -ENOMEM;
> > +
> > +	rq_for_each_bvec(bv, rq, rq_iter) {
> > +		get_page(bv.bv_page);
> > +		node->buf->bvec[i].bv_page = bv.bv_page;
> > +		node->buf->bvec[i].bv_len = bv.bv_len;
> > +		node->buf->bvec[i].bv_offset = bv.bv_offset;
> 
> bvec_set_page() should be more convenient

Indeed.

> > +		i++;
> > +	}
> > +	data->nodes[index] = node;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> > +
> 
> ...
> >   			unsigned long seg_skip;
> > diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> > index abd0d5d42c3e1..d1d90d9cd2b43 100644
> > --- a/io_uring/rsrc.h
> > +++ b/io_uring/rsrc.h
> > @@ -13,6 +13,7 @@
> >   enum {
> >   	IORING_RSRC_FILE		= 0,
> >   	IORING_RSRC_BUFFER		= 1,
> > +	IORING_RSRC_KBUF		= 2,
> 
> The name "kbuf" is already used, to avoid confusion let's rename it.
> Ming called it leased buffers before, I think it's a good name.

These are just fixed buffers, just like user space onces. The only
difference is where the buffer comes from: kernel or userspace? I don't
see what the term "lease" has to do with this.




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux