On 10/23/24 8:38 PM, hexue wrote: > On 9/25/2024 12:12, Pavel Begunkov wrote: >> I don't have a strong opinion on the feature, but the open question >> we should get some decision on is whether it's really well applicable to >> a good enough set of apps / workloads, if it'll even be useful in the >> future and/or for other vendors, and if the merit outweighs extra >> 8 bytes + 1 flag per io_kiocb and the overhead of 1-2 static key'able >> checks in hot paths. > > IMHO, releasing some of the CPU resources during the polling > process may be appropriate for some performance bottlenecks > due to CPU resource constraints, such as some database > applications, in addition to completing IO operations, CPU > also needs to peocess data, like compression and decompression. > In a high-concurrency state, not only polling takes up a lot of > CPU time, but also operations like calculation and processing > also need to compete for CPU time. In this case, the performance > of the application may be difficult to improve. > > The MultiRead interface of Rocksdb has been adapted to io_uring, > I used db_bench to construct a situation with high CPU pressure > and compared the performance. The test configuration is as follows, > > ------------------------------------------------------------------- > CPU Model Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz > CPU Cores 8 > Memory 16G > SSD Samsung PM9A3 > ------------------------------------------------------------------- > > Test case? > ./db_bench --benchmarks=multireadrandom,stats > --duration=60 > --threads=4/8/16 > --use_direct_reads=true > --db=/mnt/rocks/test_db > --wal_dir=/mnt/rocks/test_db > --key_size=4 > --value_size=4096 > -cache_size=0 > -use_existing_db=1 > -batch_size=256 > -multiread_batched=true > -multiread_stride=0 > ------------------------------------------------------ > Test result? > National Optimization > threads ops/sec ops/sec CPU Utilization > 16 139300 189075 100%*8 > 8 138639 133191 90%*8 > 4 71475 68361 90%*8 > ------------------------------------------------------ > > When the number of threads exceeds the number of CPU cores,the > database throughput does not increase significantly. However, > hybrid polling can releasing some CPU resources during the polling > process, so that part of the CPU time can be used for frequent > data processing and other operations, which speeds up the reading > process, thereby improving throughput and optimizaing database > performance.I tried different compression strategies and got > results similar to the above table.(~30% throughput improvement) > > As more database applications adapt to the io_uring engine, I think > the application of hybrid poll may have potential in some scenarios. Thanks for posting some numbers on that part, that's useful. I do think the feature is useful as well, but I still have some issues with the implementation. Below is an incremental patch on top of yours to resolve some of those, potentially. Issues: 1) The patch still reads a bit like a hack, in that it doesn't seem to care about following the current style. This reads a bit lazy/sloppy or unfinished. I've fixed that up. 2) Appropriate member and function naming. 3) Same as above, it doesn't care about proper placement of additions to structs. Again this is a bit lazy and wasteful, attention should be paid to where additions are placed to not needlessly bloat structures, or place members in cache unfortunate locations. For example, available_time is just placed at the end of io_ring_ctx, why? It's a submission side member, and there's room with other related members. Not only is the placement now where you'd want it to be, memory wise, it also doesn't add 8 bytes to io_uring_ctx. 4) Like the above, the io_kiocb bloat is, by far, the worst. Seems to me that we can share space with the polling hash_node. This obviously needs careful vetting, haven't done that yet. IOPOLL setups should not be using poll at all. This needs extra checking. The poll_state can go with cancel_seq_set, as there's a hole there any. And just like that, rather than add 24b to io_kiocb, it doesn't take any extra space at all. 5) HY_POLL is a terrible name. It's associated with IOPOLL, and so let's please use a name related to that. And require IOPOLL being set with HYBRID_IOPOLL, as it's really a variant of that. Makes it clear that HYBRID_IOPOLL is really just a mode of operation for IOPOLL, and it can't exist without that. Please take a look at this incremental and test it, and then post a v9 that looks a lot more finished. Caveat - I haven't tested this one at all. Thanks! diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index c79ee9fe86d4..6cf6a45835e5 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -238,6 +238,8 @@ struct io_ring_ctx { struct io_rings *rings; struct percpu_ref refs; + u64 poll_available_time; + clockid_t clockid; enum tk_offsets clock_offset; @@ -433,9 +435,6 @@ struct io_ring_ctx { struct page **sqe_pages; struct page **cq_wait_page; - - /* for io_uring hybrid poll*/ - u64 available_time; }; struct io_tw_state { @@ -647,9 +646,22 @@ struct io_kiocb { atomic_t refs; bool cancel_seq_set; + bool poll_state; struct io_task_work io_task_work; - /* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */ - struct hlist_node hash_node; + union { + /* + * for polled requests, i.e. IORING_OP_POLL_ADD and async armed + * poll + */ + struct hlist_node hash_node; + /* + * For IOPOLL setup queues, with hybrid polling + */ + struct { + u64 iopoll_start; + u64 iopoll_end; + }; + }; /* internal polling, see IORING_FEAT_FAST_POLL */ struct async_poll *apoll; /* opcode allocated if it needs to store data for async defer */ @@ -665,10 +677,6 @@ struct io_kiocb { u64 extra1; u64 extra2; } big_cqe; - /* for io_uring hybrid iopoll */ - bool poll_state; - u64 iopoll_start; - u64 iopoll_end; }; struct io_overflow_cqe { diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 034997a1e507..5a290a56af6c 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -199,7 +199,7 @@ enum io_uring_sqe_flags_bit { * Removes indirection through the SQ index array. */ #define IORING_SETUP_NO_SQARRAY (1U << 16) -#define IORING_SETUP_HY_POLL (1U << 17) +#define IORING_SETUP_HYBRID_IOPOLL (1U << 17) enum io_uring_op { IORING_OP_NOP, diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 9631e10d681b..35071442fb70 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -307,7 +307,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) goto err; ctx->flags = p->flags; - ctx->available_time = LLONG_MAX; + ctx->poll_available_time = LLONG_MAX; atomic_set(&ctx->cq_wait_nr, IO_CQ_WAKE_INIT); init_waitqueue_head(&ctx->sqo_sq_wait); INIT_LIST_HEAD(&ctx->sqd_list); @@ -3646,6 +3646,11 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, ctx->clockid = CLOCK_MONOTONIC; ctx->clock_offset = 0; + /* HYBRID_IOPOLL only valid with IOPOLL */ + if ((ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_HYBRID_IOPOLL)) == + IORING_SETUP_HYBRID_IOPOLL) + return -EINVAL; + if (!(ctx->flags & IORING_SETUP_NO_SQARRAY)) static_branch_inc(&io_key_has_sqarray); @@ -3808,7 +3813,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) IORING_SETUP_SQE128 | IORING_SETUP_CQE32 | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY | - IORING_SETUP_NO_SQARRAY | IORING_SETUP_HY_POLL)) + IORING_SETUP_NO_SQARRAY | IORING_SETUP_HYBRID_IOPOLL)) return -EINVAL; return io_uring_create(entries, &p, params); diff --git a/io_uring/rw.c b/io_uring/rw.c index b86cef10ed72..6f7bf40df85a 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -782,13 +782,6 @@ static bool need_complete_io(struct io_kiocb *req) S_ISBLK(file_inode(req->file)->i_mode); } -static void init_hybrid_poll(struct io_kiocb *req) -{ - /* make sure every req only block once*/ - req->poll_state = false; - req->iopoll_start = ktime_get_ns(); -} - static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); @@ -826,8 +819,11 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) kiocb->ki_flags |= IOCB_HIPRI; kiocb->ki_complete = io_complete_rw_iopoll; req->iopoll_completed = 0; - if (ctx->flags & IORING_SETUP_HY_POLL) - init_hybrid_poll(req); + if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) { + /* make sure every req only blocks once*/ + req->poll_state = false; + req->iopoll_start = ktime_get_ns(); + } } else { if (kiocb->ki_flags & IOCB_HIPRI) return -EINVAL; @@ -1126,27 +1122,24 @@ void io_rw_fail(struct io_kiocb *req) io_req_set_res(req, res, req->cqe.flags); } -static int io_uring_classic_poll(struct io_kiocb *req, - struct io_comp_batch *iob, unsigned int poll_flags) +static int io_uring_classic_poll(struct io_kiocb *req, struct io_comp_batch *iob, + unsigned int poll_flags) { - int ret; struct file *file = req->file; if (req->opcode == IORING_OP_URING_CMD) { struct io_uring_cmd *ioucmd; ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); - ret = file->f_op->uring_cmd_iopoll(ioucmd, iob, - poll_flags); + return file->f_op->uring_cmd_iopoll(ioucmd, iob, poll_flags); } else { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); - ret = file->f_op->iopoll(&rw->kiocb, iob, poll_flags); + return file->f_op->iopoll(&rw->kiocb, iob, poll_flags); } - return ret; } -static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req) +static u64 io_hybrid_iopoll_delay(struct io_ring_ctx *ctx, struct io_kiocb *req) { struct hrtimer_sleeper timer; enum hrtimer_mode mode; @@ -1156,11 +1149,11 @@ static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req) if (req->poll_state) return 0; - if (ctx->available_time == LLONG_MAX) + if (ctx->poll_available_time == LLONG_MAX) return 0; - /* Using half running time to do schedul */ - sleep_time = ctx->available_time / 2; + /* Use half the running time to do schedule */ + sleep_time = ctx->poll_available_time / 2; kt = ktime_set(0, sleep_time); req->poll_state = true; @@ -1177,7 +1170,6 @@ static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req) hrtimer_cancel(&timer.timer); __set_current_state(TASK_RUNNING); destroy_hrtimer_on_stack(&timer.timer); - return sleep_time; } @@ -1185,19 +1177,21 @@ static int io_uring_hybrid_poll(struct io_kiocb *req, struct io_comp_batch *iob, unsigned int poll_flags) { struct io_ring_ctx *ctx = req->ctx; - int ret; u64 runtime, sleep_time; + int ret; - sleep_time = io_delay(ctx, req); + sleep_time = io_hybrid_iopoll_delay(ctx, req); ret = io_uring_classic_poll(req, iob, poll_flags); req->iopoll_end = ktime_get_ns(); runtime = req->iopoll_end - req->iopoll_start - sleep_time; - /* use minimize sleep time if there are different speed - * drivers, it could get more completions from fast one + /* + * Use minimum sleep time if we're polling devices with different + * latencies. We could get more completions from the faster ones. */ - if (ctx->available_time > runtime) - ctx->available_time = runtime; + if (ctx->poll_available_time > runtime) + ctx->poll_available_time = runtime; + return ret; } @@ -1227,7 +1221,7 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) if (READ_ONCE(req->iopoll_completed)) break; - if (ctx->flags & IORING_SETUP_HY_POLL) + if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) ret = io_uring_hybrid_poll(req, &iob, poll_flags); else ret = io_uring_classic_poll(req, &iob, poll_flags); -- Jens Axboe