On 8/30/22 10:54, Dylan Yudaken wrote:
On Mon, 2022-08-22 at 12:34 +0100, Pavel Begunkov wrote:
[...]
+
+ node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
+ if (node != &fake) {
+ current_final = &fake;
+ node = io_llist_xchg(&ctx->work_llist, &fake);
+ goto again;
+ }
+
+ if (locked) {
+ io_submit_flush_completions(ctx);
+ mutex_unlock(&ctx->uring_lock);
+ }
+ return ret;
+}
I was thinking about:
int io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
{
locked = try_lock();
}
bool locked = false;
io_run_local_work(ctx, *locked);
if (locked)
unlock();
// or just as below when already holding it
bool locked = true;
io_run_local_work(ctx, *locked);
Which would replace
if (DEFER) {
// we're assuming that it'll unlock
io_run_local_work(true);
} else {
unlock();
}
with
if (DEFER) {
bool locked = true;
io_run_local_work(&locked);
}
unlock();
But anyway, it can be mulled later.
I think there is an easier way to clean it up if we allow an extra
unlock/lock in io_uring_enter (see below). Will do that in v4
fwiw, I'm fine with the current code, the rest can
be cleaned up later if you'd prefer so.
[...]
@@ -3057,10 +3160,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
int, fd, u32, to_submit,
}
if ((flags & IORING_ENTER_GETEVENTS) && ctx-
syscall_iopoll)
goto iopoll_locked;
+ if ((flags & IORING_ENTER_GETEVENTS) &&
+ (ctx->flags & IORING_SETUP_DEFER_TASKRUN))
{
+ int ret2 = io_run_local_work(ctx, true);
+
+ if (unlikely(ret2 < 0))
+ goto out;
It's an optimisation and we don't have to handle errors here,
let's ignore them and make it looking a bit better.
I'm not convinced about that - as then there is no way the application
will know it is trying to complete events on the wrong thread. Work
will just silently pile up instead.
by optimisation I mean exactly this chunk right after submsission.
If it's a wrong thread this will be ignored, then control flow will
fall into cq_wait and then fail there returning an error. So, the
userspace should get an error in the end but the handling would be
consolidated in cq_wait.
That being said - with the changes below I can just get rid of this
code I think.
+ goto getevents_ran_local;
+ }
mutex_unlock(&ctx->uring_lock);
}
+
if (flags & IORING_ENTER_GETEVENTS) {
int ret2;
+
if (ctx->syscall_iopoll) {
/*
* We disallow the app entering
submit/complete with
@@ -3081,6 +3194,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
int, fd, u32, to_submit,
const sigset_t __user *sig;
struct __kernel_timespec __user *ts;
+ if (ctx->flags &
IORING_SETUP_DEFER_TASKRUN) {
I think it should be in io_cqring_wait(), which calls it anyway
in the beginning. Instead of
do {
io_cqring_overflow_flush(ctx);
if (io_cqring_events(ctx) >= min_events)
return 0;
if (!io_run_task_work())
break;
} while (1);
Let's have
do {
ret = io_run_task_work_ctx();
// handle ret
io_cqring_overflow_flush(ctx);
if (io_cqring_events(ctx) >= min_events)
return 0;
} while (1);
I think that is ok.
The downside is that it adds an extra lock/unlock of the ctx in some
cases. I assume that will be neglegible?
Not sure there will be any extra locking. IIRC, it was about replacing
// io_uring_enter() -> GETEVENTS path
run_tw();
// io_cqring_wait()
while (cqes_ready() < needed)
run_tw();
With:
// io_uring_enter()
do {
run_tw();
} while(cqes_ready() < needed);
+ ret2 = io_run_local_work(ctx,
false);
+ if (unlikely(ret2 < 0))
+ goto getevents_out;
+ }
+getevents_ran_local:
ret2 = io_get_ext_arg(flags, argp, &argsz,
&ts, &sig);
if (likely(!ret2)) {
min_complete = min(min_complete,
@@ -3090,6 +3209,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
fd, u32, to_submit,
}
}
+getevents_out:
if (!ret) {
ret = ret2;
@@ -3289,17 +3409,29 @@ static __cold int io_uring_create(unsigned
entries, struct io_uring_params *p,
if (ctx->flags & IORING_SETUP_SQPOLL) {
/* IPI related flags don't make sense with SQPOLL
*/
if (ctx->flags & (IORING_SETUP_COOP_TASKRUN |
- IORING_SETUP_TASKRUN_FLAG))
+ IORING_SETUP_TASKRUN_FLAG |
+ IORING_SETUP_DEFER_TASKRUN))
Sounds like we should also fail if SQPOLL is set, especially with
the task check on the waiting side.
That is what this code is doing I think? Did I miss something?
Ok, great then
--
Pavel Begunkov