Re: [PATCH for-next v3 4/7] io_uring: add IORING_SETUP_DEFER_TASKRUN

Dylan Yudaken <dylany@xxxxxx> · Tue, 30 Aug 2022 09:54:56 +0000

On Mon, 2022-08-22 at 12:34 +0100, Pavel Begunkov wrote:
> On 8/19/22 13:19, Dylan Yudaken wrote:
> > Allow deferring async tasks until the user calls io_uring_enter(2)
> > with
> > the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
> > io_uring_setup time. This functionality requires that the later
> > io_uring_enter will be called from the same submission task, and
> > therefore
> > restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is
> > also
> > set.
> 
> Looks ok, a couple of small comments below, but I don't see anything
> blocking it.
> 
> > Being able to hand pick when tasks are run prevents the problem
> > where
> > there is current work to be done, however task work runs anyway.
> > 
> > For example, a common workload would obtain a batch of CQEs, and
> > process
> > each one. Interrupting this to additional taskwork would add
> > latency but
> > not gain anything. If instead task work is deferred to just before
> > more
> > CQEs are obtained then no additional latency is added.
> > 
> > The way this is implemented is by trying to keep task work local to
> > a
> > io_ring_ctx, rather than to the submission task. This is required,
> > as the
> > application will want to wake up only a single io_ring_ctx at a
> > time to
> > process work, and so the lists of work have to be kept separate.
> > 
> > This has some other benefits like not having to check the task
> > continually
> > in handle_tw_list (and potentially unlocking/locking those), and
> > reducing
> > locks in the submit & process completions path.
> > 
> > There are networking cases where using this option can reduce
> > request
> > latency by 50%. For example a contrived example using [1] where the
> > client
> > sends 2k data and receives the same data back while doing some
> > system
> > calls (to trigger task work) shows this reduction. The reason ends
> > up
> > being that if sending responses is delayed by processing task work,
> > then
> > the client side sits idle. Whereas reordering the sends first means
> > that
> > the client runs it's workload in parallel with the local task work.
> 
> Quite contrived, for some it may cut latency in half but for others
> as easily increate it twofold. In any case, it's not a critique of
> the
> feature as it's optional, but rather raises a question whether we
> need to add some fairness / scheduling here.
> 
> > [1]:
> > Using https://github.com/DylanZA/netbench/tree/defer_run
> > Client:
> > ./netbench  --client_only 1 --control_port 10000 --host <host> --tx
> > "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --
> > workload 1000"
> > Server:
> > ./netbench  --server_only 1 --control_port 10000  --rx "io_uring --
> > defer_taskrun 0 --workload 100"   --rx "io_uring  --defer_taskrun 1
> > --workload 100"
> > 
> > Signed-off-by: Dylan Yudaken <dylany@xxxxxx>
> > ---
> 
> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > index 53696dd90626..6572d2276750 100644
> > --- a/io_uring/io_uring.c
> > +++ b/io_uring/io_uring.c
> [...]
> 
> > +int io_run_local_work(struct io_ring_ctx *ctx, bool locked)
> > +{
> > +       struct llist_node *node;
> > +       struct llist_node fake;
> > +       struct llist_node *current_final = NULL;
> > +       int ret;
> > +
> > +       if (unlikely(ctx->submitter_task != current)) {
> > +               if (locked)
> > +                       mutex_unlock(&ctx->uring_lock);
> > +
> > +               /* maybe this is before any submissions */
> > +               if (!ctx->submitter_task)
> > +                       return 0;
> > +
> > +               return -EEXIST;
> > +       }
> > +
> > +       if (!locked)
> > +               locked = mutex_trylock(&ctx->uring_lock);
> > +
> > +       node = io_llist_xchg(&ctx->work_llist, &fake);
> > +       ret = 0;
> > +again:
> > +       while (node != current_final) {
> > +               struct llist_node *next = node->next;
> > +               struct io_kiocb *req = container_of(node, struct
> > io_kiocb,
> > +                                                  
> > io_task_work.node);
> > +               prefetch(container_of(next, struct io_kiocb,
> > io_task_work.node));
> > +               req->io_task_work.func(req, &locked);
> > +               ret++;
> > +               node = next;
> > +       }
> > +
> > +       if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
> > +               atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings-
> > >sq_flags);
> > +
> > +       node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
> > +       if (node != &fake) {
> > +               current_final = &fake;
> > +               node = io_llist_xchg(&ctx->work_llist, &fake);
> > +               goto again;
> > +       }
> > +
> > +       if (locked) {
> > +               io_submit_flush_completions(ctx);
> > +               mutex_unlock(&ctx->uring_lock);
> > +       }
> > +       return ret;
> > +}
> 
> I was thinking about:
> 
> int io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
> {
>         locked = try_lock();
> }
> 
> bool locked = false;
> io_run_local_work(ctx, *locked);
> if (locked)
>         unlock();
> 
> // or just as below when already holding it
> bool locked = true;
> io_run_local_work(ctx, *locked);
> 
> Which would replace
> 
> if (DEFER) {
>         // we're assuming that it'll unlock
>         io_run_local_work(true);
> } else {
>         unlock();
> }
> 
> with
> 
> if (DEFER) {
>         bool locked = true;
>         io_run_local_work(&locked);
> }
> unlock();
> 
> But anyway, it can be mulled later.

I think there is an easier way to clean it up if we allow an extra
unlock/lock in io_uring_enter (see below). Will do that in v4

> 
> 
> > -int io_run_task_work_sig(void)
> > +int io_run_task_work_sig(struct io_ring_ctx *ctx)
> >   {
> > -       if (io_run_task_work())
> > +       if (io_run_task_work_ctx(ctx))
> >                 return 1;
> >         if (task_sigpending(current))
> >                 return -EINTR;
> > @@ -2196,7 +2294,7 @@ static inline int
> > io_cqring_wait_schedule(struct io_ring_ctx *ctx,
> >         unsigned long check_cq;
> >   
> >         /* make sure we run task_work before checking for signals
> > */
> > -       ret = io_run_task_work_sig();
> > +       ret = io_run_task_work_sig(ctx);
> >         if (ret || io_should_wake(iowq))
> >                 return ret;
> >   
> > @@ -2230,7 +2328,7 @@ static int io_cqring_wait(struct io_ring_ctx
> > *ctx, int min_events,
> >                 io_cqring_overflow_flush(ctx);
> >                 if (io_cqring_events(ctx) >= min_events)
> >                         return 0;
> > -               if (!io_run_task_work())
> > +               if (!io_run_task_work_ctx(ctx))
> >                         break;
> >         } while (1);
> >   
> > @@ -2573,6 +2671,9 @@ static __cold void io_ring_exit_work(struct
> > work_struct *work)
> >          * as nobody else will be looking for them.
> >          */
> >         do {
> > +               if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
> > +                       io_move_task_work_from_local(ctx);
> > +
> >                 while (io_uring_try_cancel_requests(ctx, NULL,
> > true))
> >                         cond_resched();
> >   
> > @@ -2768,6 +2869,8 @@ static __cold bool
> > io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
> >                 }
> >         }
> >   
> > +       if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
> > +               ret |= io_run_local_work(ctx, false) > 0;
> >         ret |= io_cancel_defer_files(ctx, task, cancel_all);
> >         mutex_lock(&ctx->uring_lock);
> >         ret |= io_poll_remove_all(ctx, task, cancel_all);
> > @@ -3057,10 +3160,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
> > int, fd, u32, to_submit,
> >                 }
> >                 if ((flags & IORING_ENTER_GETEVENTS) && ctx-
> > >syscall_iopoll)
> >                         goto iopoll_locked;
> > +               if ((flags & IORING_ENTER_GETEVENTS) &&
> > +                       (ctx->flags & IORING_SETUP_DEFER_TASKRUN))
> > {
> > +                       int ret2 = io_run_local_work(ctx, true);
> > +
> > +                       if (unlikely(ret2 < 0))
> > +                               goto out;
> 
> It's an optimisation and we don't have to handle errors here,
> let's ignore them and make it looking a bit better.

I'm not convinced about that - as then there is no way the application
will know it is trying to complete events on the wrong thread. Work
will just silently pile up instead.
That being said - with the changes below I can just get rid of this
code I think.

> 
> > +                       goto getevents_ran_local;
> > +               }
> >                 mutex_unlock(&ctx->uring_lock);
> >         }
> > +
> >         if (flags & IORING_ENTER_GETEVENTS) {
> >                 int ret2;
> > +
> >                 if (ctx->syscall_iopoll) {
> >                         /*
> >                          * We disallow the app entering
> > submit/complete with
> > @@ -3081,6 +3194,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
> > int, fd, u32, to_submit,
> >                         const sigset_t __user *sig;
> >                         struct __kernel_timespec __user *ts;
> >   
> > +                       if (ctx->flags &
> > IORING_SETUP_DEFER_TASKRUN) {
> 
> I think it should be in io_cqring_wait(), which calls it anyway
> in the beginning. Instead of
> 
>         do {
>                 io_cqring_overflow_flush(ctx);
>                 if (io_cqring_events(ctx) >= min_events)
>                         return 0;
>                 if (!io_run_task_work())
>                         break;
>         } while (1);
> 
> Let's have
> 
>         do {
>                 ret = io_run_task_work_ctx();
>                 // handle ret
>                 io_cqring_overflow_flush(ctx);
>                 if (io_cqring_events(ctx) >= min_events)
>                         return 0;
>         } while (1);

I think that is ok.
The downside is that it adds an extra lock/unlock of the ctx in some
cases. I assume that will be neglegible?

> 
> > +                               ret2 = io_run_local_work(ctx,
> > false);
> > +                               if (unlikely(ret2 < 0))
> > +                                       goto getevents_out;
> > +                       }
> > +getevents_ran_local:
> >                         ret2 = io_get_ext_arg(flags, argp, &argsz,
> > &ts, &sig);
> >                         if (likely(!ret2)) {
> >                                 min_complete = min(min_complete,
> > @@ -3090,6 +3209,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
> > fd, u32, to_submit,
> >                         }
> >                 }
> >   
> > +getevents_out:
> >                 if (!ret) {
> >                         ret = ret2;
> >   
> > @@ -3289,17 +3409,29 @@ static __cold int io_uring_create(unsigned
> > entries, struct io_uring_params *p,
> >         if (ctx->flags & IORING_SETUP_SQPOLL) {
> >                 /* IPI related flags don't make sense with SQPOLL
> > */
> >                 if (ctx->flags & (IORING_SETUP_COOP_TASKRUN |
> > -                                 IORING_SETUP_TASKRUN_FLAG))
> > +                                 IORING_SETUP_TASKRUN_FLAG |
> > +                                 IORING_SETUP_DEFER_TASKRUN))
> 
> Sounds like we should also fail if SQPOLL is set, especially with
> the task check on the waiting side.
> 

That is what this code is doing I think? Did I miss something?

> >                         goto err;
> >                 ctx->notify_method = TWA_SIGNAL_NO_IPI;
> >         } else if (ctx->flags & IORING_SETUP_COOP_TASKRUN) {
> >                 ctx->notify_method = TWA_SIGNAL_NO_IPI;
> [...]
> >         mutex_lock(&ctx->uring_lock);
> >         ret = __io_uring_register(ctx, opcode, arg, nr_args);
> > diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> > index 2f73f83af960..a9fb115234af 100644
> > --- a/io_uring/io_uring.h
> > +++ b/io_uring/io_uring.h
> > @@ -26,7 +26,8 @@ enum {
> [...]
> > +static inline int io_run_task_work_unlock_ctx(struct io_ring_ctx
> > *ctx)
> > +{
> > +       int ret;
> > +
> > +       if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> > +               ret = io_run_local_work(ctx, true);
> > +       } else {
> > +               mutex_unlock(&ctx->uring_lock);
> > +               ret = (int)io_run_task_work();
> 
> Why do we need a cast? let's keep the return type same

Ok I'll update the return types here

Dylan