Re: [PATCH v8] io_uring: releasing CPU resources when polling

Jens Axboe <axboe@xxxxxxxxx> · Thu, 24 Oct 2024 08:18:14 -0600

On 10/23/24 8:38 PM, hexue wrote:
> On 9/25/2024 12:12, Pavel Begunkov wrote:
>> I don't have a strong opinion on the feature, but the open question
>> we should get some decision on is whether it's really well applicable to
>> a good enough set of apps / workloads, if it'll even be useful in the
>> future and/or for other vendors, and if the merit outweighs extra
>> 8 bytes + 1 flag per io_kiocb and the overhead of 1-2 static key'able
>> checks in hot paths.
> 
> IMHO, releasing some of the CPU resources during the polling
> process may be appropriate for some performance bottlenecks
> due to CPU resource constraints, such as some database
> applications, in addition to completing IO operations, CPU
> also needs to peocess data, like compression and decompression.
> In a high-concurrency state, not only polling takes up a lot of
> CPU time, but also operations like calculation and processing
> also need to compete for CPU time. In this case, the performance
> of the application may be difficult to improve.
> 
> The MultiRead interface of Rocksdb has been adapted to io_uring,
> I used db_bench to construct a situation with high CPU pressure
> and compared the performance. The test configuration is as follows,
> 
> -------------------------------------------------------------------
> CPU Model 	Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
> CPU Cores	8
> Memory		16G
> SSD			Samsung PM9A3
> -------------------------------------------------------------------
> 
> Test case?
> ./db_bench --benchmarks=multireadrandom,stats
> --duration=60
> --threads=4/8/16
> --use_direct_reads=true
> --db=/mnt/rocks/test_db
> --wal_dir=/mnt/rocks/test_db
> --key_size=4
> --value_size=4096
> -cache_size=0
> -use_existing_db=1
> -batch_size=256
> -multiread_batched=true
> -multiread_stride=0
> ------------------------------------------------------
> Test result?
> 			National	Optimization
> threads		ops/sec		ops/sec		CPU Utilization
> 16			139300		189075		100%*8
> 8			138639		133191		90%*8
> 4			71475		68361		90%*8
> ------------------------------------------------------
> 
> When the number of threads exceeds the number of CPU cores,the
> database throughput does not increase significantly. However,
> hybrid polling can releasing some CPU resources during the polling
> process, so that part of the CPU time can be used for frequent
> data processing and other operations, which speeds up the reading
> process, thereby improving throughput and optimizaing database
> performance.I tried different compression strategies and got
> results similar to the above table.(~30% throughput improvement)
> 
> As more database applications adapt to the io_uring engine, I think
> the application of hybrid poll may have potential in some scenarios.

Thanks for posting some numbers on that part, that's useful. I do
think the feature is useful as well, but I still have some issues
with the implementation. Below is an incremental patch on top of
yours to resolve some of those, potentially. Issues:

1) The patch still reads a bit like a hack, in that it doesn't seem to
   care about following the current style. This reads a bit lazy/sloppy
   or unfinished. I've fixed that up.

2) Appropriate member and function naming.

3) Same as above, it doesn't care about proper placement of additions to
   structs. Again this is a bit lazy and wasteful, attention should be
   paid to where additions are placed to not needlessly bloat
   structures, or place members in cache unfortunate locations. For
   example, available_time is just placed at the end of io_ring_ctx,
   why? It's a submission side member, and there's room with other
   related members. Not only is the placement now where you'd want it to
   be, memory wise, it also doesn't add 8 bytes to io_uring_ctx.

4) Like the above, the io_kiocb bloat is, by far, the worst. Seems to me
   that we can share space with the polling hash_node. This obviously
   needs careful vetting, haven't done that yet. IOPOLL setups should
   not be using poll at all. This needs extra checking. The poll_state
   can go with cancel_seq_set, as there's a hole there any. And just
   like that, rather than add 24b to io_kiocb, it doesn't take any extra
   space at all.

5) HY_POLL is a terrible name. It's associated with IOPOLL, and so let's
   please use a name related to that. And require IOPOLL being set with
   HYBRID_IOPOLL, as it's really a variant of that. Makes it clear that
   HYBRID_IOPOLL is really just a mode of operation for IOPOLL, and it
   can't exist without that.

Please take a look at this incremental and test it, and then post a v9
that looks a lot more finished. Caveat - I haven't tested this one at
all. Thanks!

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index c79ee9fe86d4..6cf6a45835e5 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -238,6 +238,8 @@ struct io_ring_ctx {
 		struct io_rings		*rings;
 		struct percpu_ref	refs;
 
+		u64			poll_available_time;
+
 		clockid_t		clockid;
 		enum tk_offsets		clock_offset;
 
@@ -433,9 +435,6 @@ struct io_ring_ctx {
 	struct page			**sqe_pages;
 
 	struct page			**cq_wait_page;
-
-	/* for io_uring hybrid poll*/
-	u64			available_time;
 };
 
 struct io_tw_state {
@@ -647,9 +646,22 @@ struct io_kiocb {
 
 	atomic_t			refs;
 	bool				cancel_seq_set;
+	bool				poll_state;
 	struct io_task_work		io_task_work;
-	/* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
-	struct hlist_node		hash_node;
+	union {
+		/*
+		 * for polled requests, i.e. IORING_OP_POLL_ADD and async armed
+		 * poll
+		 */
+		struct hlist_node	hash_node;
+		/*
+		 * For IOPOLL setup queues, with hybrid polling
+		 */
+		struct {
+			u64		iopoll_start;
+			u64		iopoll_end;
+		};
+	};
 	/* internal polling, see IORING_FEAT_FAST_POLL */
 	struct async_poll		*apoll;
 	/* opcode allocated if it needs to store data for async defer */
@@ -665,10 +677,6 @@ struct io_kiocb {
 		u64			extra1;
 		u64			extra2;
 	} big_cqe;
-    /* for io_uring hybrid iopoll */
-	bool		poll_state;
-	u64			iopoll_start;
-	u64			iopoll_end;
 };
 
 struct io_overflow_cqe {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 034997a1e507..5a290a56af6c 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -199,7 +199,7 @@ enum io_uring_sqe_flags_bit {
  * Removes indirection through the SQ index array.
  */
 #define IORING_SETUP_NO_SQARRAY		(1U << 16)
-#define IORING_SETUP_HY_POLL	(1U << 17)
+#define IORING_SETUP_HYBRID_IOPOLL	(1U << 17)
 
 enum io_uring_op {
 	IORING_OP_NOP,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 9631e10d681b..35071442fb70 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -307,7 +307,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 		goto err;
 
 	ctx->flags = p->flags;
-	ctx->available_time = LLONG_MAX;
+	ctx->poll_available_time = LLONG_MAX;
 	atomic_set(&ctx->cq_wait_nr, IO_CQ_WAKE_INIT);
 	init_waitqueue_head(&ctx->sqo_sq_wait);
 	INIT_LIST_HEAD(&ctx->sqd_list);
@@ -3646,6 +3646,11 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	ctx->clockid = CLOCK_MONOTONIC;
 	ctx->clock_offset = 0;
 
+	/* HYBRID_IOPOLL only valid with IOPOLL */
+	if ((ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_HYBRID_IOPOLL)) ==
+	    IORING_SETUP_HYBRID_IOPOLL)
+		return -EINVAL;
+
 	if (!(ctx->flags & IORING_SETUP_NO_SQARRAY))
 		static_branch_inc(&io_key_has_sqarray);
 
@@ -3808,7 +3813,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
 			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN |
 			IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY |
-			IORING_SETUP_NO_SQARRAY | IORING_SETUP_HY_POLL))
+			IORING_SETUP_NO_SQARRAY | IORING_SETUP_HYBRID_IOPOLL))
 		return -EINVAL;
 
 	return io_uring_create(entries, &p, params);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index b86cef10ed72..6f7bf40df85a 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -782,13 +782,6 @@ static bool need_complete_io(struct io_kiocb *req)
 		S_ISBLK(file_inode(req->file)->i_mode);
 }
 
-static void init_hybrid_poll(struct io_kiocb *req)
-{
-	/* make sure every req only block once*/
-	req->poll_state = false;
-	req->iopoll_start = ktime_get_ns();
-}
-
 static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 {
 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
@@ -826,8 +819,11 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 		kiocb->ki_flags |= IOCB_HIPRI;
 		kiocb->ki_complete = io_complete_rw_iopoll;
 		req->iopoll_completed = 0;
-		if (ctx->flags & IORING_SETUP_HY_POLL)
-			init_hybrid_poll(req);
+		if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) {
+			/* make sure every req only blocks once*/
+			req->poll_state = false;
+			req->iopoll_start = ktime_get_ns();
+		}
 	} else {
 		if (kiocb->ki_flags & IOCB_HIPRI)
 			return -EINVAL;
@@ -1126,27 +1122,24 @@ void io_rw_fail(struct io_kiocb *req)
 	io_req_set_res(req, res, req->cqe.flags);
 }
 
-static int io_uring_classic_poll(struct io_kiocb *req,
-		struct io_comp_batch *iob, unsigned int poll_flags)
+static int io_uring_classic_poll(struct io_kiocb *req, struct io_comp_batch *iob,
+				 unsigned int poll_flags)
 {
-	int ret;
 	struct file *file = req->file;
 
 	if (req->opcode == IORING_OP_URING_CMD) {
 		struct io_uring_cmd *ioucmd;
 
 		ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
-		ret = file->f_op->uring_cmd_iopoll(ioucmd, iob,
-						poll_flags);
+		return file->f_op->uring_cmd_iopoll(ioucmd, iob, poll_flags);
 	} else {
 		struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
 
-		ret = file->f_op->iopoll(&rw->kiocb, iob, poll_flags);
+		return file->f_op->iopoll(&rw->kiocb, iob, poll_flags);
 	}
-	return ret;
 }
 
-static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req)
+static u64 io_hybrid_iopoll_delay(struct io_ring_ctx *ctx, struct io_kiocb *req)
 {
 	struct hrtimer_sleeper timer;
 	enum hrtimer_mode mode;
@@ -1156,11 +1149,11 @@ static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req)
 	if (req->poll_state)
 		return 0;
 
-	if (ctx->available_time == LLONG_MAX)
+	if (ctx->poll_available_time == LLONG_MAX)
 		return 0;
 
-	/* Using half running time to do schedul */
-	sleep_time = ctx->available_time / 2;
+	/* Use half the running time to do schedule */
+	sleep_time = ctx->poll_available_time / 2;
 
 	kt = ktime_set(0, sleep_time);
 	req->poll_state = true;
@@ -1177,7 +1170,6 @@ static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req)
 	hrtimer_cancel(&timer.timer);
 	__set_current_state(TASK_RUNNING);
 	destroy_hrtimer_on_stack(&timer.timer);
-
 	return sleep_time;
 }
 
@@ -1185,19 +1177,21 @@ static int io_uring_hybrid_poll(struct io_kiocb *req,
 				struct io_comp_batch *iob, unsigned int poll_flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
-	int ret;
 	u64 runtime, sleep_time;
+	int ret;
 
-	sleep_time = io_delay(ctx, req);
+	sleep_time = io_hybrid_iopoll_delay(ctx, req);
 	ret = io_uring_classic_poll(req, iob, poll_flags);
 	req->iopoll_end = ktime_get_ns();
 	runtime = req->iopoll_end - req->iopoll_start - sleep_time;
 
-	/* use minimize sleep time if there are different speed
-	 * drivers, it could get more completions from fast one
+	/*
+	 * Use minimum sleep time if we're polling devices with different
+	 * latencies. We could get more completions from the faster ones.
 	 */
-	if (ctx->available_time > runtime)
-		ctx->available_time = runtime;
+	if (ctx->poll_available_time > runtime)
+		ctx->poll_available_time = runtime;
+
 	return ret;
 }
 
@@ -1227,7 +1221,7 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 		if (READ_ONCE(req->iopoll_completed))
 			break;
 
-		if (ctx->flags & IORING_SETUP_HY_POLL)
+		if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL)
 			ret = io_uring_hybrid_poll(req, &iob, poll_flags);
 		else
 			ret = io_uring_classic_poll(req, &iob, poll_flags);

-- 
Jens Axboe