On 8/3/21 3:16 PM, Nadav Amit wrote: > > >> On Aug 3, 2021, at 12:53 PM, Jens Axboe <axboe@xxxxxxxxx> wrote: >> >> How about this? I think this largely stems from the fact that we only >> do a partial running decrement on exit. Left the previous checks in >> place as well, as it will reduce the amount of times that we do need >> to hit that case. > > It did not apply cleanly on my 5.13, but after I cleaned it, it still > got stuck (more frequently than when I used your previous solution). > > I do not see the problem related to the partial running decrement. > Thinking of it, I think that the problem might even happen if > multiple calls to io_wqe_activate_free_worker() wake up the same worker, > not realizing that they race (since __io_worker_busy() was still not > called by io_worker_handle_work()). That's actually by design for io-wq in general, we assume that the work won't block, and in that case we only want to activate the one worker. > Anyhow, I think there are a few problems in the patch you sent. Once I > addressed a couple of problems, my test passes, but I am not sure you > actually want to final result, and I am not sure it is robust/correct. > > See my comments below for the changes I added and other questions I > have (you can answer only if you have time). > >> >> >> diff --git a/fs/io-wq.c b/fs/io-wq.c >> index cf086b01c6c6..f072995d382b 100644 >> --- a/fs/io-wq.c >> +++ b/fs/io-wq.c >> @@ -35,12 +35,17 @@ enum { >> IO_WQE_FLAG_STALLED = 1, /* stalled on hash */ >> }; >> >> +enum { >> + IO_WORKER_EXITING = 0, /* worker is exiting */ >> +}; >> + >> /* >> * One for each thread in a wqe pool >> */ >> struct io_worker { >> refcount_t ref; >> unsigned flags; >> + unsigned long state; >> struct hlist_nulls_node nulls_node; >> struct list_head all_list; >> struct task_struct *task; >> @@ -130,6 +135,7 @@ struct io_cb_cancel_data { >> }; >> >> static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index); >> +static void io_wqe_dec_running(struct io_worker *worker); >> >> static bool io_worker_get(struct io_worker *worker) >> { >> @@ -168,26 +174,21 @@ static void io_worker_exit(struct io_worker *worker) >> { >> struct io_wqe *wqe = worker->wqe; >> struct io_wqe_acct *acct = io_wqe_get_acct(worker); >> - unsigned flags; >> >> if (refcount_dec_and_test(&worker->ref)) >> complete(&worker->ref_done); >> wait_for_completion(&worker->ref_done); >> >> - preempt_disable(); >> - current->flags &= ~PF_IO_WORKER; >> - flags = worker->flags; >> - worker->flags = 0; >> - if (flags & IO_WORKER_F_RUNNING) >> - atomic_dec(&acct->nr_running); >> - worker->flags = 0; >> - preempt_enable(); >> - >> raw_spin_lock_irq(&wqe->lock); >> - if (flags & IO_WORKER_F_FREE) >> + if (worker->flags & IO_WORKER_F_FREE) >> hlist_nulls_del_rcu(&worker->nulls_node); >> list_del_rcu(&worker->all_list); >> acct->nr_workers--; >> + preempt_disable(); >> + io_wqe_dec_running(worker); > > IIUC, in the scenario I encountered, acct->nr_running might be non-zero, > but still a new worker would be needed. So the check in io_wqe_dec_running() > is insufficient to spawn a new worker at this point, no? If nr_running != 0, then we have active workers. They will either complete the work they have without blocking, or if they block, then we'll create a new one. So it really should be enough, I'm a bit puzzled... >> + worker->flags = 0; >> + current->flags &= ~PF_IO_WORKER; >> + preempt_enable(); >> raw_spin_unlock_irq(&wqe->lock); >> >> kfree_rcu(worker, rcu); >> @@ -214,15 +215,20 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe) >> struct hlist_nulls_node *n; >> struct io_worker *worker; >> >> - n = rcu_dereference(hlist_nulls_first_rcu(&wqe->free_list)); >> - if (is_a_nulls(n)) >> - return false; >> - >> - worker = hlist_nulls_entry(n, struct io_worker, nulls_node); >> - if (io_worker_get(worker)) { >> - wake_up_process(worker->task); >> + /* >> + * Iterate free_list and see if we can find an idle worker to >> + * activate. If a given worker is on the free_list but in the process >> + * of exiting, keep trying. >> + */ >> + hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) { >> + if (!io_worker_get(worker)) >> + continue; > > Presumably you want to rely on the order between io_worker_get(), i.e. > the refcount_inc_not_zero() and the test_bit(). I guess no memory-barrier > is needed here (since refcount_inc_not_zero() returns a value) but > documentation would help. Anyhow, I do not see how it helps. Right, no extra barriers needed. >> + if (!test_bit(IO_WORKER_EXITING, &worker->state)) { >> + wake_up_process(worker->task); > > So this might be the main problem. The worker might be in between waking > and setting IO_WORKER_EXITING. One option (that I tried and works, at > least in limited testing), is to look whether the process was actually > woken according to the return value of wake_up_process() and not to > use workers that were not actually woken. > > So I changed it to: > if (wake_up_process(worker->task)) { > io_worker_release(worker); > return true; > } > > >> + io_worker_release(worker); > > The refcount is decreased, so the refcount_read in io_wqe_worker() > would not see the elevated refcount. No? That's probably not a bad idea, though not quite sure that'd always be safe. I'm going to need to look deeper, because we really should not have a lot of concurrent activity here in terms of multiple issuers looking up free workers and activating them. Can you share a bit about what the workload looks like? That might help create a reproducer, which would be handy going forward as well. >> + return true; >> + } >> io_worker_release(worker); >> - return true; >> } >> >> return false; >> @@ -560,8 +566,17 @@ static int io_wqe_worker(void *data) >> if (ret) >> continue; >> /* timed out, exit unless we're the fixed worker */ >> - if (!(worker->flags & IO_WORKER_F_FIXED)) >> + if (!(worker->flags & IO_WORKER_F_FIXED)) { >> + /* >> + * Someone elevated our refs, which could be trying >> + * to re-activate for work. Loop one more time for >> + * that case. >> + */ >> + if (refcount_read(&worker->ref) != 1) >> + continue; > > I am not sure what it serves, as the refcount is decreased in > io_wqe_activate_free_worker() right after wake_up_process(). It should just go away I think, it'd be better to cut the patch down to the functional part. > Anyhow, presumably you need smp_mb__before_atomic() here, no? I added > one. Yet, without the check in the wake_up_process() this still seems > borken. Yes, it would need that. -- Jens Axboe