Re: [PATCH 1/4] io-wq: tweak return value of io_wqe_create_worker()

Hao Xu <haoxu@xxxxxxxxxxxxxxxxx> · Mon, 13 Sep 2021 14:37:18 +0800

在 2021/9/13 上午5:34, Jens Axboe 写道:
On 9/12/21 1:02 PM, Hao Xu wrote:
在 2021/9/13 上午2:10, Jens Axboe 写道:
On 9/11/21 1:40 PM, Hao Xu wrote:
The return value of io_wqe_create_worker() should be false if we cannot
create a new worker according to the name of this function.

Signed-off-by: Hao Xu <haoxu@xxxxxxxxxxxxxxxxx>
---
   fs/io-wq.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 382efca4812b..1b102494e970 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -267,7 +267,7 @@ static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
   		return create_io_worker(wqe->wq, wqe, acct->index);
   	}
   
-	return true;
+	return false;
   }

I think this is just a bit confusing. It's not an error case, we just
didn't need to create a worker. So don't return failure, or the caller
will think that we failed while we did not.
hmm, I think it is an error case----'we failed to create a new worker
since nr_worker == max_worker'. nr_worker == max_worker doesn't mean
'no need', we may meet situation describled in 4/4: max_worker is 1,

But that's not an error case in the sense of "uh oh, we need to handle
this as an error". If we're at the max worker count, the work simply has
to wait for another work to be done and process it.

currently 1 worker is running, and we return true here:

            did_create = io_wqe_create_worker(wqe, acct);

               //*******nr_workers changes******//

            if (unlikely(!did_create)) {
                    raw_spin_lock(&wqe->lock);
                    /* fatal condition, failed to create the first worker */
                    if (!acct->nr_workers) {
                            raw_spin_unlock(&wqe->lock);
                            goto run_cancel;
                    }
                    raw_spin_unlock(&wqe->lock);
            }

we will miss the next check, but we have to do the check, since
number of workers may decrease to 0 in //******// place.

If that happens, then the work that we have inserted has already been
run. Otherwise how else could we have dropped to zero workers?

Sorry, I see. I forgot the fix moved the place of nr_workers...
There is no problems now. Thanks for explanation, Jens.

 io_wqe_enqueue                   worker1
                               no work there and timeout
                               nr_workers--(after fix)
                               unlock(wqe->lock)
 ->insert work

 ->io_wqe_create_worker

                               ->io_worker_exit
                                 ->nr_workers--(before fix)