On 9/19/24 4:22 AM, Pavel Begunkov wrote: > On 9/18/24 19:03, Jens Axboe wrote: >> io_cqring_wait() doesn't run normal task_work after the local work, and >> it's the only location to do it in that order. Normally this doesn't >> matter, except if: >> >> 1) The ring is setup with DEFER_TASKRUN >> 2) The local work item may generate normal task_work >> >> For condition 2, this can happen when closing a file and it's the final >> put of that file, for example. This can cause stalls where a task is >> waiting to make progress, but there's nothing else that will wake it up. > > TIF_NOTIFY_SIGNAL from normal task_work should prevent the task > from sleeping until it processes task works, that should make > the waiting loop make another iteration and get to the task work > execution again (if it continues to sleep). I don't understand how > the patch works, but if it's legit sounds we have a bigger problem, > e.g. what if someone else queue up a work right after that tw > execution block. It's not TIF_NOTIFY_SIGNAL, for that case it would've been fine. It would've just meant another loop around for waiting. As the likelihood of defer task_work generating normal task_work is infinitely higher than the other way around, I do think re-ordering makes sense regardless. The final fput will use TIF_NOTIFY_RESUME, as it should not be something that interrupts the task. Just needs to get done eventually when it exits to userspace. But for this case obviously that's a bit more problematic. We can also do something like the below which should fix it as well, which may be a better approach. At least, as it currently stands, TIF_NOTIFY_RESUME and TIF_NOTIFY_SIGNAL are the two signaling mechanisms for that. Hence checking for pending task_work and ensuring our task_work run handles both should be saner. I'd still swap the ordering of the task_work runs, however. diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 75f0087183e5..56097627eafc 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -2472,7 +2472,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, return 1; if (unlikely(!llist_empty(&ctx->work_llist))) return 1; - if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL))) + if (unlikely(task_work_pending(current))) return 1; if (unlikely(task_sigpending(current))) return -EINTR; diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 9d70b2cf7b1e..2fbf0ea9c171 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -308,15 +308,17 @@ static inline int io_run_task_work(void) */ if (test_thread_flag(TIF_NOTIFY_SIGNAL)) clear_notify_signal(); + + if (test_thread_flag(TIF_NOTIFY_RESUME)) { + __set_current_state(TASK_RUNNING); + resume_user_mode_work(NULL); + } + /* * PF_IO_WORKER never returns to userspace, so check here if we have * notify work that needs processing. */ if (current->flags & PF_IO_WORKER) { - if (test_thread_flag(TIF_NOTIFY_RESUME)) { - __set_current_state(TASK_RUNNING); - resume_user_mode_work(NULL); - } if (current->io_uring) { unsigned int count = 0; -- Jens Axboe