Re: [PATCH for-next v3 4/7] io_uring: add IORING_SETUP_DEFER_TASKRUN

Pavel Begunkov <asml.silence@xxxxxxxxx> · Tue, 30 Aug 2022 11:29:37 +0100




On 8/30/22 10:54, Dylan Yudaken wrote:
On Mon, 2022-08-22 at 12:34 +0100, Pavel Begunkov wrote:
[...]
+
+       node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
+       if (node != &fake) {
+               current_final = &fake;
+               node = io_llist_xchg(&ctx->work_llist, &fake);
+               goto again;
+       }
+
+       if (locked) {
+               io_submit_flush_completions(ctx);
+               mutex_unlock(&ctx->uring_lock);
+       }
+       return ret;
+}

I was thinking about:

int io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
{
         locked = try_lock();
}

bool locked = false;
io_run_local_work(ctx, *locked);
if (locked)
         unlock();

// or just as below when already holding it
bool locked = true;
io_run_local_work(ctx, *locked);

Which would replace

if (DEFER) {
         // we're assuming that it'll unlock
         io_run_local_work(true);
} else {
         unlock();
}

with

if (DEFER) {
         bool locked = true;
         io_run_local_work(&locked);
}
unlock();

But anyway, it can be mulled later.

I think there is an easier way to clean it up if we allow an extra
unlock/lock in io_uring_enter (see below). Will do that in v4

fwiw, I'm fine with the current code, the rest can
be cleaned up later if you'd prefer so.

[...]
@@ -3057,10 +3160,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
int, fd, u32, to_submit,
                 }
                 if ((flags & IORING_ENTER_GETEVENTS) && ctx-
syscall_iopoll)
                         goto iopoll_locked;
+               if ((flags & IORING_ENTER_GETEVENTS) &&
+                       (ctx->flags & IORING_SETUP_DEFER_TASKRUN))
{
+                       int ret2 = io_run_local_work(ctx, true);
+
+                       if (unlikely(ret2 < 0))
+                               goto out;

It's an optimisation and we don't have to handle errors here,
let's ignore them and make it looking a bit better.

I'm not convinced about that - as then there is no way the application
will know it is trying to complete events on the wrong thread. Work
will just silently pile up instead.

by optimisation I mean exactly this chunk right after submsission.
If it's a wrong thread this will be ignored, then control flow will
fall into cq_wait and then fail there returning an error. So, the
userspace should get an error in the end but the handling would be
consolidated in cq_wait.

That being said - with the changes below I can just get rid of this
code I think.


+                       goto getevents_ran_local;
+               }
                 mutex_unlock(&ctx->uring_lock);
         }
+
         if (flags & IORING_ENTER_GETEVENTS) {
                 int ret2;
+
                 if (ctx->syscall_iopoll) {
                         /*
                          * We disallow the app entering
submit/complete with
@@ -3081,6 +3194,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
int, fd, u32, to_submit,
                         const sigset_t __user *sig;
                         struct __kernel_timespec __user *ts;
   
+                       if (ctx->flags &
IORING_SETUP_DEFER_TASKRUN) {

I think it should be in io_cqring_wait(), which calls it anyway
in the beginning. Instead of

         do {
                 io_cqring_overflow_flush(ctx);
                 if (io_cqring_events(ctx) >= min_events)
                         return 0;
                 if (!io_run_task_work())
                         break;
         } while (1);

Let's have

         do {
                 ret = io_run_task_work_ctx();
                 // handle ret
                 io_cqring_overflow_flush(ctx);
                 if (io_cqring_events(ctx) >= min_events)
                         return 0;
         } while (1);

I think that is ok.
The downside is that it adds an extra lock/unlock of the ctx in some
cases. I assume that will be neglegible?

Not sure there will be any extra locking. IIRC, it was about replacing

// io_uring_enter() -> GETEVENTS path
run_tw();
// io_cqring_wait()
while (cqes_ready() < needed)
	run_tw();

With:

// io_uring_enter()
do {
	run_tw();
} while(cqes_ready() < needed);


+                               ret2 = io_run_local_work(ctx,
false);
+                               if (unlikely(ret2 < 0))
+                                       goto getevents_out;
+                       }
+getevents_ran_local:
                         ret2 = io_get_ext_arg(flags, argp, &argsz,
&ts, &sig);
                         if (likely(!ret2)) {
                                 min_complete = min(min_complete,
@@ -3090,6 +3209,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
fd, u32, to_submit,
                         }
                 }
   
+getevents_out:
                 if (!ret) {
                         ret = ret2;
   
@@ -3289,17 +3409,29 @@ static __cold int io_uring_create(unsigned
entries, struct io_uring_params *p,
         if (ctx->flags & IORING_SETUP_SQPOLL) {
                 /* IPI related flags don't make sense with SQPOLL
*/
                 if (ctx->flags & (IORING_SETUP_COOP_TASKRUN |
-                                 IORING_SETUP_TASKRUN_FLAG))
+                                 IORING_SETUP_TASKRUN_FLAG |
+                                 IORING_SETUP_DEFER_TASKRUN))

Sounds like we should also fail if SQPOLL is set, especially with
the task check on the waiting side.


That is what this code is doing I think? Did I miss something?

Ok, great then

--
Pavel Begunkov