On Mon, 2022-08-22 at 12:34 +0100, Pavel Begunkov wrote: > On 8/19/22 13:19, Dylan Yudaken wrote: > > Allow deferring async tasks until the user calls io_uring_enter(2) > > with > > the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at > > io_uring_setup time. This functionality requires that the later > > io_uring_enter will be called from the same submission task, and > > therefore > > restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is > > also > > set. > > Looks ok, a couple of small comments below, but I don't see anything > blocking it. > > > Being able to hand pick when tasks are run prevents the problem > > where > > there is current work to be done, however task work runs anyway. > > > > For example, a common workload would obtain a batch of CQEs, and > > process > > each one. Interrupting this to additional taskwork would add > > latency but > > not gain anything. If instead task work is deferred to just before > > more > > CQEs are obtained then no additional latency is added. > > > > The way this is implemented is by trying to keep task work local to > > a > > io_ring_ctx, rather than to the submission task. This is required, > > as the > > application will want to wake up only a single io_ring_ctx at a > > time to > > process work, and so the lists of work have to be kept separate. > > > > This has some other benefits like not having to check the task > > continually > > in handle_tw_list (and potentially unlocking/locking those), and > > reducing > > locks in the submit & process completions path. > > > > There are networking cases where using this option can reduce > > request > > latency by 50%. For example a contrived example using [1] where the > > client > > sends 2k data and receives the same data back while doing some > > system > > calls (to trigger task work) shows this reduction. The reason ends > > up > > being that if sending responses is delayed by processing task work, > > then > > the client side sits idle. Whereas reordering the sends first means > > that > > the client runs it's workload in parallel with the local task work. > > Quite contrived, for some it may cut latency in half but for others > as easily increate it twofold. In any case, it's not a critique of > the > feature as it's optional, but rather raises a question whether we > need to add some fairness / scheduling here. > > > [1]: > > Using https://github.com/DylanZA/netbench/tree/defer_run > > Client: > > ./netbench --client_only 1 --control_port 10000 --host <host> --tx > > "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 -- > > workload 1000" > > Server: > > ./netbench --server_only 1 --control_port 10000 --rx "io_uring -- > > defer_taskrun 0 --workload 100" --rx "io_uring --defer_taskrun 1 > > --workload 100" > > > > Signed-off-by: Dylan Yudaken <dylany@xxxxxx> > > --- > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > > index 53696dd90626..6572d2276750 100644 > > --- a/io_uring/io_uring.c > > +++ b/io_uring/io_uring.c > [...] > > > +int io_run_local_work(struct io_ring_ctx *ctx, bool locked) > > +{ > > + struct llist_node *node; > > + struct llist_node fake; > > + struct llist_node *current_final = NULL; > > + int ret; > > + > > + if (unlikely(ctx->submitter_task != current)) { > > + if (locked) > > + mutex_unlock(&ctx->uring_lock); > > + > > + /* maybe this is before any submissions */ > > + if (!ctx->submitter_task) > > + return 0; > > + > > + return -EEXIST; > > + } > > + > > + if (!locked) > > + locked = mutex_trylock(&ctx->uring_lock); > > + > > + node = io_llist_xchg(&ctx->work_llist, &fake); > > + ret = 0; > > +again: > > + while (node != current_final) { > > + struct llist_node *next = node->next; > > + struct io_kiocb *req = container_of(node, struct > > io_kiocb, > > + > > io_task_work.node); > > + prefetch(container_of(next, struct io_kiocb, > > io_task_work.node)); > > + req->io_task_work.func(req, &locked); > > + ret++; > > + node = next; > > + } > > + > > + if (ctx->flags & IORING_SETUP_TASKRUN_FLAG) > > + atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings- > > >sq_flags); > > + > > + node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL); > > + if (node != &fake) { > > + current_final = &fake; > > + node = io_llist_xchg(&ctx->work_llist, &fake); > > + goto again; > > + } > > + > > + if (locked) { > > + io_submit_flush_completions(ctx); > > + mutex_unlock(&ctx->uring_lock); > > + } > > + return ret; > > +} > > I was thinking about: > > int io_run_local_work(struct io_ring_ctx *ctx, bool *locked) > { > locked = try_lock(); > } > > bool locked = false; > io_run_local_work(ctx, *locked); > if (locked) > unlock(); > > // or just as below when already holding it > bool locked = true; > io_run_local_work(ctx, *locked); > > Which would replace > > if (DEFER) { > // we're assuming that it'll unlock > io_run_local_work(true); > } else { > unlock(); > } > > with > > if (DEFER) { > bool locked = true; > io_run_local_work(&locked); > } > unlock(); > > But anyway, it can be mulled later. I think there is an easier way to clean it up if we allow an extra unlock/lock in io_uring_enter (see below). Will do that in v4 > > > > -int io_run_task_work_sig(void) > > +int io_run_task_work_sig(struct io_ring_ctx *ctx) > > { > > - if (io_run_task_work()) > > + if (io_run_task_work_ctx(ctx)) > > return 1; > > if (task_sigpending(current)) > > return -EINTR; > > @@ -2196,7 +2294,7 @@ static inline int > > io_cqring_wait_schedule(struct io_ring_ctx *ctx, > > unsigned long check_cq; > > > > /* make sure we run task_work before checking for signals > > */ > > - ret = io_run_task_work_sig(); > > + ret = io_run_task_work_sig(ctx); > > if (ret || io_should_wake(iowq)) > > return ret; > > > > @@ -2230,7 +2328,7 @@ static int io_cqring_wait(struct io_ring_ctx > > *ctx, int min_events, > > io_cqring_overflow_flush(ctx); > > if (io_cqring_events(ctx) >= min_events) > > return 0; > > - if (!io_run_task_work()) > > + if (!io_run_task_work_ctx(ctx)) > > break; > > } while (1); > > > > @@ -2573,6 +2671,9 @@ static __cold void io_ring_exit_work(struct > > work_struct *work) > > * as nobody else will be looking for them. > > */ > > do { > > + if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) > > + io_move_task_work_from_local(ctx); > > + > > while (io_uring_try_cancel_requests(ctx, NULL, > > true)) > > cond_resched(); > > > > @@ -2768,6 +2869,8 @@ static __cold bool > > io_uring_try_cancel_requests(struct io_ring_ctx *ctx, > > } > > } > > > > + if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) > > + ret |= io_run_local_work(ctx, false) > 0; > > ret |= io_cancel_defer_files(ctx, task, cancel_all); > > mutex_lock(&ctx->uring_lock); > > ret |= io_poll_remove_all(ctx, task, cancel_all); > > @@ -3057,10 +3160,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned > > int, fd, u32, to_submit, > > } > > if ((flags & IORING_ENTER_GETEVENTS) && ctx- > > >syscall_iopoll) > > goto iopoll_locked; > > + if ((flags & IORING_ENTER_GETEVENTS) && > > + (ctx->flags & IORING_SETUP_DEFER_TASKRUN)) > > { > > + int ret2 = io_run_local_work(ctx, true); > > + > > + if (unlikely(ret2 < 0)) > > + goto out; > > It's an optimisation and we don't have to handle errors here, > let's ignore them and make it looking a bit better. I'm not convinced about that - as then there is no way the application will know it is trying to complete events on the wrong thread. Work will just silently pile up instead. That being said - with the changes below I can just get rid of this code I think. > > > + goto getevents_ran_local; > > + } > > mutex_unlock(&ctx->uring_lock); > > } > > + > > if (flags & IORING_ENTER_GETEVENTS) { > > int ret2; > > + > > if (ctx->syscall_iopoll) { > > /* > > * We disallow the app entering > > submit/complete with > > @@ -3081,6 +3194,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned > > int, fd, u32, to_submit, > > const sigset_t __user *sig; > > struct __kernel_timespec __user *ts; > > > > + if (ctx->flags & > > IORING_SETUP_DEFER_TASKRUN) { > > I think it should be in io_cqring_wait(), which calls it anyway > in the beginning. Instead of > > do { > io_cqring_overflow_flush(ctx); > if (io_cqring_events(ctx) >= min_events) > return 0; > if (!io_run_task_work()) > break; > } while (1); > > Let's have > > do { > ret = io_run_task_work_ctx(); > // handle ret > io_cqring_overflow_flush(ctx); > if (io_cqring_events(ctx) >= min_events) > return 0; > } while (1); I think that is ok. The downside is that it adds an extra lock/unlock of the ctx in some cases. I assume that will be neglegible? > > > + ret2 = io_run_local_work(ctx, > > false); > > + if (unlikely(ret2 < 0)) > > + goto getevents_out; > > + } > > +getevents_ran_local: > > ret2 = io_get_ext_arg(flags, argp, &argsz, > > &ts, &sig); > > if (likely(!ret2)) { > > min_complete = min(min_complete, > > @@ -3090,6 +3209,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, > > fd, u32, to_submit, > > } > > } > > > > +getevents_out: > > if (!ret) { > > ret = ret2; > > > > @@ -3289,17 +3409,29 @@ static __cold int io_uring_create(unsigned > > entries, struct io_uring_params *p, > > if (ctx->flags & IORING_SETUP_SQPOLL) { > > /* IPI related flags don't make sense with SQPOLL > > */ > > if (ctx->flags & (IORING_SETUP_COOP_TASKRUN | > > - IORING_SETUP_TASKRUN_FLAG)) > > + IORING_SETUP_TASKRUN_FLAG | > > + IORING_SETUP_DEFER_TASKRUN)) > > Sounds like we should also fail if SQPOLL is set, especially with > the task check on the waiting side. > That is what this code is doing I think? Did I miss something? > > goto err; > > ctx->notify_method = TWA_SIGNAL_NO_IPI; > > } else if (ctx->flags & IORING_SETUP_COOP_TASKRUN) { > > ctx->notify_method = TWA_SIGNAL_NO_IPI; > [...] > > mutex_lock(&ctx->uring_lock); > > ret = __io_uring_register(ctx, opcode, arg, nr_args); > > diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h > > index 2f73f83af960..a9fb115234af 100644 > > --- a/io_uring/io_uring.h > > +++ b/io_uring/io_uring.h > > @@ -26,7 +26,8 @@ enum { > [...] > > +static inline int io_run_task_work_unlock_ctx(struct io_ring_ctx > > *ctx) > > +{ > > + int ret; > > + > > + if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) { > > + ret = io_run_local_work(ctx, true); > > + } else { > > + mutex_unlock(&ctx->uring_lock); > > + ret = (int)io_run_task_work(); > > Why do we need a cast? let's keep the return type same Ok I'll update the return types here Dylan