On Wed, Aug 02, 2023 at 06:05:53PM +0200, Jan Kara wrote: > On Fri 21-07-23 17:57:15, Ming Lei wrote: > > From: David Jeffery <djeffery@xxxxxxxxxx> > > > > Current code supposes that it is enough to provide forward progress by just > > waking up one wait queue after one completion batch is done. > > > > Unfortunately this way isn't enough, cause waiter can be added to > > wait queue just after it is woken up. > > > > Follows one example(64 depth, wake_batch is 8) > > > > 1) all 64 tags are active > > > > 2) in each wait queue, there is only one single waiter > > > > 3) each time one completion batch(8 completions) wakes up just one waiter in each wait > > queue, then immediately one new sleeper is added to this wait queue > > > > 4) after 64 completions, 8 waiters are wakeup, and there are still 8 waiters in each > > wait queue > > > > 5) after another 8 active tags are completed, only one waiter can be wakeup, and the other 7 > > can't be waken up anymore. > > > > Turns out it isn't easy to fix this problem, so simply wakeup enough waiters for > > single batch. > > > > Cc: David Jeffery <djeffery@xxxxxxxxxx> > > Cc: Kemeng Shi <shikemeng@xxxxxxxxxxxxxxx> > > Cc: Gabriel Krisman Bertazi <krisman@xxxxxxx> > > Cc: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> > > Cc: Jan Kara <jack@xxxxxxx> > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > I'm sorry for the delay - I was on vacation. I can see the patch got > already merged and I'm not strictly against that (although I think Gabriel > was experimenting with this exact wakeup scheme and as far as I remember > the more eager waking up was causing performance decrease for some > configurations). But let me challenge the analysis above a bit. For the > sleeper to be added to a waitqueue in step 3), blk_mq_mark_tag_wait() must > fail the blk_mq_get_driver_tag() call. Which means that all tags were used Here only allocating request by blk_mq_get_tag() is involved, and getting driver tag isn't involved. > at that moment. To summarize, anytime we add any new waiter to the > waitqueue, all tags are used and thus we should eventually receive enough > wakeups to wake all of them. What am I missing? When running the final retry(__blk_mq_get_tag) before sleeping(io_schedule()) in blk_mq_get_tag(), the sleeper has been added to wait queue. So when two completion batch comes, the two may wake up same wq because same ->wake_index can be observed from two completion path, and both two wake_up_nr() can return > 0 because adding sleeper into wq and wake_up_nr() can be interleaved, then 16 completions just wakeup 2 sleepers added to same wq. If the story happens on one wq with >= 8 sleepers, io hang will be triggered, if there are another two pending wq. Thanks, Ming