> On Aug 3, 2021, at 12:53 PM, Jens Axboe <axboe@xxxxxxxxx> wrote: > > How about this? I think this largely stems from the fact that we only > do a partial running decrement on exit. Left the previous checks in > place as well, as it will reduce the amount of times that we do need > to hit that case. It did not apply cleanly on my 5.13, but after I cleaned it, it still got stuck (more frequently than when I used your previous solution). I do not see the problem related to the partial running decrement. Thinking of it, I think that the problem might even happen if multiple calls to io_wqe_activate_free_worker() wake up the same worker, not realizing that they race (since __io_worker_busy() was still not called by io_worker_handle_work()). Anyhow, I think there are a few problems in the patch you sent. Once I addressed a couple of problems, my test passes, but I am not sure you actually want to final result, and I am not sure it is robust/correct. See my comments below for the changes I added and other questions I have (you can answer only if you have time). > > > diff --git a/fs/io-wq.c b/fs/io-wq.c > index cf086b01c6c6..f072995d382b 100644 > --- a/fs/io-wq.c > +++ b/fs/io-wq.c > @@ -35,12 +35,17 @@ enum { > IO_WQE_FLAG_STALLED = 1, /* stalled on hash */ > }; > > +enum { > + IO_WORKER_EXITING = 0, /* worker is exiting */ > +}; > + > /* > * One for each thread in a wqe pool > */ > struct io_worker { > refcount_t ref; > unsigned flags; > + unsigned long state; > struct hlist_nulls_node nulls_node; > struct list_head all_list; > struct task_struct *task; > @@ -130,6 +135,7 @@ struct io_cb_cancel_data { > }; > > static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index); > +static void io_wqe_dec_running(struct io_worker *worker); > > static bool io_worker_get(struct io_worker *worker) > { > @@ -168,26 +174,21 @@ static void io_worker_exit(struct io_worker *worker) > { > struct io_wqe *wqe = worker->wqe; > struct io_wqe_acct *acct = io_wqe_get_acct(worker); > - unsigned flags; > > if (refcount_dec_and_test(&worker->ref)) > complete(&worker->ref_done); > wait_for_completion(&worker->ref_done); > > - preempt_disable(); > - current->flags &= ~PF_IO_WORKER; > - flags = worker->flags; > - worker->flags = 0; > - if (flags & IO_WORKER_F_RUNNING) > - atomic_dec(&acct->nr_running); > - worker->flags = 0; > - preempt_enable(); > - > raw_spin_lock_irq(&wqe->lock); > - if (flags & IO_WORKER_F_FREE) > + if (worker->flags & IO_WORKER_F_FREE) > hlist_nulls_del_rcu(&worker->nulls_node); > list_del_rcu(&worker->all_list); > acct->nr_workers--; > + preempt_disable(); > + io_wqe_dec_running(worker); IIUC, in the scenario I encountered, acct->nr_running might be non-zero, but still a new worker would be needed. So the check in io_wqe_dec_running() is insufficient to spawn a new worker at this point, no? > + worker->flags = 0; > + current->flags &= ~PF_IO_WORKER; > + preempt_enable(); > raw_spin_unlock_irq(&wqe->lock); > > kfree_rcu(worker, rcu); > @@ -214,15 +215,20 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe) > struct hlist_nulls_node *n; > struct io_worker *worker; > > - n = rcu_dereference(hlist_nulls_first_rcu(&wqe->free_list)); > - if (is_a_nulls(n)) > - return false; > - > - worker = hlist_nulls_entry(n, struct io_worker, nulls_node); > - if (io_worker_get(worker)) { > - wake_up_process(worker->task); > + /* > + * Iterate free_list and see if we can find an idle worker to > + * activate. If a given worker is on the free_list but in the process > + * of exiting, keep trying. > + */ > + hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) { > + if (!io_worker_get(worker)) > + continue; Presumably you want to rely on the order between io_worker_get(), i.e. the refcount_inc_not_zero() and the test_bit(). I guess no memory-barrier is needed here (since refcount_inc_not_zero() returns a value) but documentation would help. Anyhow, I do not see how it helps. > + if (!test_bit(IO_WORKER_EXITING, &worker->state)) { > + wake_up_process(worker->task); So this might be the main problem. The worker might be in between waking and setting IO_WORKER_EXITING. One option (that I tried and works, at least in limited testing), is to look whether the process was actually woken according to the return value of wake_up_process() and not to use workers that were not actually woken. So I changed it to: if (wake_up_process(worker->task)) { io_worker_release(worker); return true; } > + io_worker_release(worker); The refcount is decreased, so the refcount_read in io_wqe_worker() would not see the elevated refcount. No? > + return true; > + } > io_worker_release(worker); > - return true; > } > > return false; > @@ -560,8 +566,17 @@ static int io_wqe_worker(void *data) > if (ret) > continue; > /* timed out, exit unless we're the fixed worker */ > - if (!(worker->flags & IO_WORKER_F_FIXED)) > + if (!(worker->flags & IO_WORKER_F_FIXED)) { > + /* > + * Someone elevated our refs, which could be trying > + * to re-activate for work. Loop one more time for > + * that case. > + */ > + if (refcount_read(&worker->ref) != 1) > + continue; I am not sure what it serves, as the refcount is decreased in io_wqe_activate_free_worker() right after wake_up_process(). Anyhow, presumably you need smp_mb__before_atomic() here, no? I added one. Yet, without the check in the wake_up_process() this still seems borken. > + set_bit(IO_WORKER_EXITING, &worker->state); > break; > + } > }