Re: [PATCH V8 0/8] io_uring: support sqe group and leased group kbuf

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/29/24 11:04 AM, Jens Axboe wrote:
>> To make my position clear, I think the table approach will turn
>> much better API-wise if the performance suffices, and we can only know
>> that experimentally. I tried that idea with sockets back then, and it
>> was looking well. It'd be great if someone tries to implement and
>> compare it, though I don't believe I should be trying it, so maybe Ming
>> or Jens can, especially since Jens already posted a couple series for
>> problems standing in the way, i.e global rsrc nodes and late buffer
>> binding. In any case, I'm not opposing to the series if Jens decides to
>> merge it.
> 
> With the rsrc node stuff sorted out, I was thinking last night that I
> should take another look at this. While that work was (mostly) done
> because of the lingering closes, it does nicely enable ephemeral buffers
> too.
> 
> I'll take a stab at it... While I would love to make progress on this
> feature proposed in this series, it's arguably more important to do it
> in such a way that we can live with it, long term.

Ming, here's another stab at this, see attached patch. It adds a
LOCAL_BUF opcode, which maps a user provided buffer to a io_rsrc_node
that opcodes can then use. The buffer is visible ONLY within a given
submission - in other words, only within a single io_uring_submit()
call. The buffer provided is done so at prep time, which means you don't
need to serialize with the LOCAL_BUF op itself. You can do:

sqe = io_uring_get_sqe(ring);
io_uring_prep_local_buf(sqe, buffer, length, tag);

sqe = io_uring_get_sqe(ring);
io_uring_prep_whatever_op_fixed(sqe, buffer, length, foo);

and have 'whatever' rely on the buffer either being there to use, or the
import failing with -EFAULT. No IOSQE_IO_LINK or similar is needed.
Obviously if you do:

sqe = io_uring_get_sqe(ring);
io_uring_prep_local_buf(sqe, buffer, length, tag);

sqe = io_uring_get_sqe(ring);
io_uring_prep_read_thing_fixed(sqe, buffer, length, foo);
sqe->flags |= IOSQE_IO_LINK;

sqe = io_uring_get_sqe(ring);
io_uring_prep_write_thing_fixed(sqe, buffer, length, foo);

then the filling of the buffer and whoever uses the filled buffer will
need to be serialized, to ensure the buffer content is valid for the
write.

Any opcode using the ephemeral/local buffer will need to grab a
reference to it, just like what is done for normal registered buffers.
If assigned to req->rsrc_node, then it'll be put as part of normal
completion. Hence no special handling is needed for this. The reference
that submit holds is assigned by LOCAL_BUF, and will be put when
submission ends. Hence no requirement that opcodes finish before submit
ends, they have their own ref.

All of that should make sense, I think. I'm attaching the most basic of
test apps I wrote to test this, as well as using:

diff --git a/io_uring/rw.c b/io_uring/rw.c
index 30448f343c7f..89662f305342 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -338,7 +338,10 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe
 	if (unlikely(ret))
 		return ret;
 
-	node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
+	if (ctx->submit_state.rsrc_node != rsrc_empty_node)
+		node = ctx->submit_state.rsrc_node;
+	else
+		node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
 	if (!node)
 		return -EFAULT;
 	io_req_assign_rsrc_node(req, node);

just so I could test it with normal read/write fixed and do a zero copy
read/write operation where the write file ends up with the data from the
read file.

When you run the test app, you should see:

axboe@m2max-kvm ~/g/liburing (reg-wait)> test/local-buf.t
buffer 0xaaaada34a000
res=0, ud=0x1
res=4096, ud=0x2
res=4096, ud=0x3
res=0, ud=0xaaaada34a000

which shows LOCAL_BUF completing first, then a 4k read, then a 4k write,
and finally the notification for the buffer being done. The test app
sets up the tag to be the buffer address, could obviously be anything
you want.

Now, this implementation requires a user buffer, and as far as I'm told,
you currently have kernel buffers on the ublk side. There's absolutely
no reason why kernel buffers cannot work, we'd most likely just need to
add a IORING_RSRC_KBUFFER type to handle that. My question here is how
hard is this requirement? Reason I ask is that it's much simpler to work
with userspace buffers. Yes the current implementation maps them
everytime, we could certainly change that, however I don't see this
being an issue. It's really no different than O_DIRECT, and you only
need to map them once for a read + whatever number of writes you'd need
to do. If a 'tag' is provided for LOCAL_BUF, it'll post a CQE whenever
that buffer is unmapped. This is a notification for the application that
it's done using the buffer. For a pure kernel buffer, we'd either need
to be able to reference it (so that we KNOW it's not going away) and/or
have a callback associated with the buffer.

Would it be possible for ublk to require the user side to register a
range of memory that should be used for the write buffers, such that
they could be mapped in the kernel instead? Maybe this memory is already
registered as such? I don't know all the details of the ublk zero copy,
but I would imagine there's some flexibility here in terms of how it
gets setup.

ublk would then need to add opcodes that utilize LOCAL_BUF for this,
obviously. As it stands, with the patch, nobody can access these
buffers, we'd need a READ_LOCAL_FIXED etc to have opcodes be able to
access them. But that should be fine, you need specific opcodes for zero
copy anyway. You can probably even reuse existing opcodes, and just add
something like IORING_URING_CMD_LOCAL as a flag rather than
IORING_URING_CMD_FIXED that is used now for registered buffers.

Let me know what you think. Like I mentioned, this is just a rough
patch. It does work though and it is safe, but obviously only does
userspace memory right now. It sits on top of my io_uring-rsrc branch,
which rewrites the rsrc handling.

-- 
Jens Axboe
From 01591be7d66879618fb8f8965141ac24e9334068 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@xxxxxxxxx>
Date: Tue, 29 Oct 2024 12:00:48 -0600
Subject: [PATCH] io_uring: add support for an ephemeral per-submit buffer

Use the rewritten rsrc node management to provide a buffer that's local
to this submission only, it'll get put when done. Opcodes will need
special support for utilizing the buffer, rather than grabbing a
registered buffer from the normal ring buffer table.

The buffer is purely submission wide, it only exists within that
submission. It is provided at prep time, so users of the buffer need
not use serializing IOSQE_IO_LINK to rely on being able to use it.
Obviously multiple requests are using the same buffer and need
serialization between them, those dependencies must be expressed.

Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
---
 include/linux/io_uring_types.h |  1 +
 include/uapi/linux/io_uring.h  |  1 +
 io_uring/io_uring.c            |  2 ++
 io_uring/opdef.c               |  7 +++++++
 io_uring/rsrc.c                | 30 ++++++++++++++++++++++++++++++
 io_uring/rsrc.h                |  3 +++
 6 files changed, 44 insertions(+)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index c283179b0c89..0ce155374016 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -208,6 +208,7 @@ struct io_submit_state {
 	bool			need_plug;
 	bool			cq_flush;
 	unsigned short		submit_nr;
+	struct io_rsrc_node	*rsrc_node;
 	struct blk_plug		plug;
 };
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ce58c4590de6..a7d0aaf6daf5 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -259,6 +259,7 @@ enum io_uring_op {
 	IORING_OP_FTRUNCATE,
 	IORING_OP_BIND,
 	IORING_OP_LISTEN,
+	IORING_OP_LOCAL_BUF,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3a535e9e8ac3..d517d6a0fd39 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2205,6 +2205,7 @@ static void io_submit_state_end(struct io_ring_ctx *ctx)
 		io_queue_sqe_fallback(state->link.head);
 	/* flush only after queuing links as they can generate completions */
 	io_submit_flush_completions(ctx);
+	io_put_rsrc_node(state->rsrc_node);
 	if (state->plug_started)
 		blk_finish_plug(&state->plug);
 }
@@ -2220,6 +2221,7 @@ static void io_submit_state_start(struct io_submit_state *state,
 	state->submit_nr = max_ios;
 	/* set only head, no need to init link_last in advance */
 	state->link.head = NULL;
+	state->rsrc_node = rsrc_empty_node;
 }
 
 static void io_commit_sqring(struct io_ring_ctx *ctx)
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 3de75eca1c92..ae18e403a7bc 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -515,6 +515,10 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_LOCAL_BUF] = {
+		.prep			= io_local_buf_prep,
+		.issue			= io_local_buf,
+	},
 };
 
 const struct io_cold_def io_cold_defs[] = {
@@ -744,6 +748,9 @@ const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_LISTEN] = {
 		.name			= "LISTEN",
 	},
+	[IORING_OP_LOCAL_BUF] = {
+		.name			= "LOCAL_BUF",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 6e30679175aa..9621ba533b35 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1069,3 +1069,33 @@ int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg)
 		fput(file);
 	return ret;
 }
+
+int io_local_buf_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_submit_state *state = &ctx->submit_state;
+	struct page *last_hpage = NULL;
+	struct io_rsrc_node *node;
+	struct iovec iov;
+	__u64 tag;
+
+	if (state->rsrc_node != rsrc_empty_node)
+		return -EBUSY;
+
+	iov.iov_base = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	iov.iov_len = READ_ONCE(sqe->len);
+	tag = READ_ONCE(sqe->addr2);
+
+	node = io_sqe_buffer_register(ctx, &iov, &last_hpage);
+	if (IS_ERR(node))
+		return PTR_ERR(node);
+
+	node->tag = tag;
+	state->rsrc_node = node;
+	return 0;
+}
+
+int io_local_buf(struct io_kiocb *req, unsigned int issue_flags)
+{
+	return IOU_OK;
+}
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index c9f42491c747..be9b490c400e 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -141,4 +141,7 @@ static inline void __io_unaccount_mem(struct user_struct *user,
 	atomic_long_sub(nr_pages, &user->locked_vm);
 }
 
+int io_local_buf_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_local_buf(struct io_kiocb *req, unsigned int issue_flags);
+
 #endif
-- 
2.45.2

/* SPDX-License-Identifier: MIT */
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>

#include "liburing.h"
#include "helpers.h"

int main(int argc, char *argv[])
{
	struct io_uring_sqe *sqe;
	struct io_uring_cqe *cqe;
	struct io_uring ring;
	void *buffer;
	int ret, in_fd, out_fd, i;
	char bufr[64], bufw[64];

	if (posix_memalign(&buffer, 4096, 4096))
		return 1;

	printf("buffer %p\n", buffer);

	sprintf(bufr, ".local-buf-read.%d\n", getpid());
	t_create_file_pattern(bufr, 4096, 0x5a);

	sprintf(bufw, ".local-buf-write.%d\n", getpid());
	t_create_file_pattern(bufw, 4096, 0x00);

	in_fd = open(bufr, O_RDONLY | O_DIRECT);
	if (in_fd < 0) {
		perror("open");
		return 1;
	}

	out_fd = open(bufw, O_WRONLY | O_DIRECT);
	if (out_fd < 0) {
		perror("open");
		return 1;
	}

	io_uring_queue_init(8, &ring, IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN);

	/* add local buf */
	sqe = io_uring_get_sqe(&ring);
	io_uring_prep_rw(IORING_OP_LOCAL_BUF, sqe, 0, buffer, 4096, (unsigned long) buffer);
	sqe->user_data = 1;

	sqe = io_uring_get_sqe(&ring);
	io_uring_prep_read_fixed(sqe, in_fd, buffer, 4096, 0, 0);
	sqe->flags |= IOSQE_IO_LINK;
	sqe->user_data = 2;

	sqe = io_uring_get_sqe(&ring);
	io_uring_prep_write_fixed(sqe, out_fd, buffer, 4096, 0, 0);
	sqe->user_data = 3;
	
	ret = io_uring_submit(&ring);
	if (ret != 3) {
		fprintf(stderr, "submit: %d\n", ret);
		return 1;
	}

	for (i = 0; i < 4; i++) {
		ret = io_uring_wait_cqe(&ring, &cqe);
		if (ret) {
			fprintf(stderr, "wait: %d\n", ret);
			return 1;
		}
		printf("res=%d, ud=0x%lx\n", cqe->res, (long) cqe->user_data);
		io_uring_cqe_seen(&ring, cqe);
	}

	return 0;
}

[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux