On 8/3/21 7:22 AM, Jens Axboe wrote: > On 8/2/21 7:05 PM, Nadav Amit wrote: >> Hello Jens, >> >> I encountered an issue, which appears to be a race between >> io_wqe_worker() and io_wqe_wake_worker(). I am not sure how to address >> this issue and whether I am missing something, since this seems to >> occur in a common scenario. Your feedback (or fix ;-)) would be >> appreciated. >> >> I run on 5.13 a workload that issues multiple async read operations >> that should run concurrently. Some read operations can not complete >> for unbounded time (e.g., read from a pipe that is never written to). >> The problem is that occasionally another read operation that should >> complete gets stuck. My understanding, based on debugging and the code >> is that the following race (or similar) occurs: >> >> >> cpu0 cpu1 >> ---- ---- >> io_wqe_worker() >> schedule_timeout() >> // timed out >> io_wqe_enqueue() >> io_wqe_wake_worker() >> // work_flags & IO_WQ_WORK_CONCURRENT >> io_wqe_activate_free_worker() >> io_worker_exit() >> >> >> Basically, io_wqe_wake_worker() can find a worker, but this worker is >> about to exit and is not going to process further work. Once the >> worker exits, the concurrency level decreases and async work might be >> blocked by another work. I had a look at 5.14, but did not see >> anything that might address this issue. >> >> Am I missing something? >> >> If not, all my ideas for a solution are either complicated (track >> required concurrency-level) or relaxed (span another worker on >> io_worker_exit if work_list of unbounded work is not empty). >> >> As said, feedback would be appreciated. > > You are right that there's definitely a race here between checking the > freelist and finding a worker, but that worker is already exiting. Let > me mull over this a bit, I'll post something for you to try later today. Can you try something like this? Just consider it a first tester, need to spend a bit more time on it to ensure we fully close the gap. diff --git a/fs/io-wq.c b/fs/io-wq.c index cf086b01c6c6..e2da2042ee9e 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -42,6 +42,7 @@ struct io_worker { refcount_t ref; unsigned flags; struct hlist_nulls_node nulls_node; + unsigned long exiting; struct list_head all_list; struct task_struct *task; struct io_wqe *wqe; @@ -214,15 +215,20 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe) struct hlist_nulls_node *n; struct io_worker *worker; - n = rcu_dereference(hlist_nulls_first_rcu(&wqe->free_list)); - if (is_a_nulls(n)) - return false; - - worker = hlist_nulls_entry(n, struct io_worker, nulls_node); - if (io_worker_get(worker)) { - wake_up_process(worker->task); + /* + * Iterate free_list and see if we can find an idle worker to + * activate. If a given worker is on the free_list but in the process + * of exiting, keep trying. + */ + hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) { + if (!io_worker_get(worker)) + continue; + if (!test_bit(0, &worker->exiting)) { + wake_up_process(worker->task); + io_worker_release(worker); + return true; + } io_worker_release(worker); - return true; } return false; @@ -560,8 +566,17 @@ static int io_wqe_worker(void *data) if (ret) continue; /* timed out, exit unless we're the fixed worker */ - if (!(worker->flags & IO_WORKER_F_FIXED)) + if (!(worker->flags & IO_WORKER_F_FIXED)) { + /* + * Someone elevated our refs, which could be trying + * to re-activate for work. Loop one more time for + * that case. + */ + if (refcount_read(&worker->ref) != 1) + continue; + set_bit(0, &worker->exiting); break; + } } if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) { -- Jens Axboe