Re: Race between io_wqe_worker() and io_wqe_wake_worker()

Jens Axboe <axboe@xxxxxxxxx> · Tue, 3 Aug 2021 08:37:52 -0600

On 8/3/21 7:22 AM, Jens Axboe wrote:
> On 8/2/21 7:05 PM, Nadav Amit wrote:
>> Hello Jens,
>>
>> I encountered an issue, which appears to be a race between
>> io_wqe_worker() and io_wqe_wake_worker(). I am not sure how to address
>> this issue and whether I am missing something, since this seems to
>> occur in a common scenario. Your feedback (or fix ;-)) would be
>> appreciated.
>>
>> I run on 5.13 a workload that issues multiple async read operations
>> that should run concurrently. Some read operations can not complete
>> for unbounded time (e.g., read from a pipe that is never written to).
>> The problem is that occasionally another read operation that should
>> complete gets stuck. My understanding, based on debugging and the code
>> is that the following race (or similar) occurs:
>>
>>
>>   cpu0					cpu1
>>   ----					----
>> 					io_wqe_worker()
>> 					 schedule_timeout()
>> 					 // timed out
>>   io_wqe_enqueue()
>>    io_wqe_wake_worker()
>>     // work_flags & IO_WQ_WORK_CONCURRENT
>>     io_wqe_activate_free_worker()
>> 					 io_worker_exit()
>>
>>
>> Basically, io_wqe_wake_worker() can find a worker, but this worker is
>> about to exit and is not going to process further work. Once the
>> worker exits, the concurrency level decreases and async work might be
>> blocked by another work. I had a look at 5.14, but did not see
>> anything that might address this issue.
>>
>> Am I missing something?
>>
>> If not, all my ideas for a solution are either complicated (track
>> required concurrency-level) or relaxed (span another worker on
>> io_worker_exit if work_list of unbounded work is not empty).
>>
>> As said, feedback would be appreciated.
> 
> You are right that there's definitely a race here between checking the
> freelist and finding a worker, but that worker is already exiting. Let
> me mull over this a bit, I'll post something for you to try later today.

Can you try something like this? Just consider it a first tester, need
to spend a bit more time on it to ensure we fully close the gap.

diff --git a/fs/io-wq.c b/fs/io-wq.c
index cf086b01c6c6..e2da2042ee9e 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -42,6 +42,7 @@ struct io_worker {
 	refcount_t ref;
 	unsigned flags;
 	struct hlist_nulls_node nulls_node;
+	unsigned long exiting;
 	struct list_head all_list;
 	struct task_struct *task;
 	struct io_wqe *wqe;
@@ -214,15 +215,20 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
 	struct hlist_nulls_node *n;
 	struct io_worker *worker;
 
-	n = rcu_dereference(hlist_nulls_first_rcu(&wqe->free_list));
-	if (is_a_nulls(n))
-		return false;
-
-	worker = hlist_nulls_entry(n, struct io_worker, nulls_node);
-	if (io_worker_get(worker)) {
-		wake_up_process(worker->task);
+	/*
+	 * Iterate free_list and see if we can find an idle worker to
+	 * activate. If a given worker is on the free_list but in the process
+	 * of exiting, keep trying.
+	 */
+	hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) {
+		if (!io_worker_get(worker))
+			continue;
+		if (!test_bit(0, &worker->exiting)) {
+			wake_up_process(worker->task);
+			io_worker_release(worker);
+			return true;
+		}
 		io_worker_release(worker);
-		return true;
 	}
 
 	return false;
@@ -560,8 +566,17 @@ static int io_wqe_worker(void *data)
 		if (ret)
 			continue;
 		/* timed out, exit unless we're the fixed worker */
-		if (!(worker->flags & IO_WORKER_F_FIXED))
+		if (!(worker->flags & IO_WORKER_F_FIXED)) {
+			/*
+			 * Someone elevated our refs, which could be trying
+			 * to re-activate for work. Loop one more time for
+			 * that case.
+			 */
+			if (refcount_read(&worker->ref) != 1)
+				continue;
+			set_bit(0, &worker->exiting);
 			break;
+		}
 	}
 
 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) {

-- 
Jens Axboe