Re: [PATCH] blk-mq: run queue after issuing the last request of the plug list

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 26 Jul 2022 15:39:52 +0800

On Tue, Jul 26, 2022 at 01:01:41PM +0800, Yufen Yu wrote:
> 
> 
> On 2022/7/26 12:16, Ming Lei wrote:
> > On Tue, Jul 26, 2022 at 11:31:34AM +0800, Yu Kuai wrote:
> > > 在 2022/07/26 11:21, Ming Lei 写道:
> > > > On Tue, Jul 26, 2022 at 11:14:23AM +0800, Yu Kuai wrote:
> > > > > Hi, Ming
> > > > > 
> > > > > 在 2022/07/26 11:02, Ming Lei 写道:
> > > > > > On Tue, Jul 26, 2022 at 10:52:56AM +0800, Yu Kuai wrote:
> > > > > > > Hi, Ming
> > > > > > > 在 2022/07/26 10:32, Ming Lei 写道:
> > > > > > > > On Tue, Jul 26, 2022 at 10:08:13AM +0800, Yu Kuai wrote:
> > > > > > > > > 在 2022/07/26 9:46, Ming Lei 写道:
> > > > > > > > > > On Tue, Jul 26, 2022 at 09:08:19AM +0800, Yu Kuai wrote:
> > > > > > > > > > > Hi, Ming!
> > > > > > > > > > > 
> > > > > > > > > > > 在 2022/07/25 23:43, Ming Lei 写道:
> > > > > > > > > > > > On Sat, Jul 23, 2022 at 10:50:03AM +0800, Yu Kuai wrote:
> > > > > > > > > > > > > Hi, Ming!
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 在 2022/07/19 17:26, Ming Lei 写道:
> > > > > > > > > > > > > > On Mon, Jul 18, 2022 at 08:35:28PM +0800, Yufen Yu wrote:
> > > > > > > > > > > > > > > We do test on a virtio scsi device (/dev/sda) and the default mq
> > > > > > > > > > > > > > > scheduler is 'none'. We found a IO hung as following:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > blk_finish_plug
> > > > > > > > > > > > > > >          blk_mq_plug_issue_direct
> > > > > > > > > > > > > > >              scsi_mq_get_budget
> > > > > > > > > > > > > > >              //get budget_token fail and sdev->restarts=1
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 			     	 scsi_end_request
> > > > > > > > > > > > > > > 				   scsi_run_queue_async
> > > > > > > > > > > > > > >                                           //sdev->restart=0 and run queue
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >             blk_mq_request_bypass_insert
> > > > > > > > > > > > > > >                //add request to hctx->dispatch list
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Here the issue shouldn't be related with scsi's get budget or
> > > > > > > > > > > > > > scsi_run_queue_async.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > If blk-mq adds request into ->dispatch_list, it is blk-mq core's
> > > > > > > > > > > > > > responsibility to re-run queue for moving on. Can you investigate a
> > > > > > > > > > > > > > bit more why blk-mq doesn't run queue after adding request to
> > > > > > > > > > > > > > hctx dispatch list?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I think Yufen is probably thinking about the following Concurrent
> > > > > > > > > > > > > scenario:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > blk_mq_flush_plug_list
> > > > > > > > > > > > > # assume there are three rq
> > > > > > > > > > > > >        blk_mq_plug_issue_direct
> > > > > > > > > > > > >         blk_mq_request_issue_directly
> > > > > > > > > > > > >         # dispatch rq1, succeed
> > > > > > > > > > > > >         blk_mq_request_issue_directly
> > > > > > > > > > > > >         # dispatch rq2
> > > > > > > > > > > > >          __blk_mq_try_issue_directly
> > > > > > > > > > > > >           blk_mq_get_dispatch_budget
> > > > > > > > > > > > >            scsi_mq_get_budget
> > > > > > > > > > > > >             atomic_inc(&sdev->restarts);
> > > > > > > > > > > > >             # rq2 failed to get budget
> > > > > > > > > > > > >             # restarts is 1 now
> > > > > > > > > > > > >                                               scsi_end_request
> > > > > > > > > > > > >                                               # rq1 is completed
> > > > > > > > > > > > >                                               ┊scsi_run_queue_async
> > > > > > > > > > > > >                                               ┊ atomic_cmpxchg(&sdev->restarts,
> > > > > > > > > > > > > old, 0) == old
> > > > > > > > > > > > >                                               ┊ # set restarts to 0
> > > > > > > > > > > > >                                               ┊ blk_mq_run_hw_queues
> > > > > > > > > > > > >                                               ┊ # hctx->dispatch list is empty
> > > > > > > > > > > > >         blk_mq_request_bypass_insert
> > > > > > > > > > > > >         # insert rq2 to hctx->dispatch list
> > > > > > > > > > > > 
> > > > > > > > > > > > After rq2 is added to ->dispatch_list in blk_mq_try_issue_list_directly(),
> > > > > > > > > > > > no matter if list_empty(list) is empty or not, queue will be run either from
> > > > > > > > > > > > blk_mq_request_bypass_insert() or blk_mq_sched_insert_requests().
> > > > > > > > > > > 
> > > > > > > > > > > 1) while inserting rq2 to dispatch list, blk_mq_request_bypass_insert()
> > > > > > > > > > > is called from blk_mq_try_issue_list_directly(), list_empty() won't
> > > > > > > > > > > pass, thus thus blk_mq_request_bypass_insert() won't run queue.
> > > > > > > > > > 
> > > > > > > > > > Yeah, but in blk_mq_try_issue_list_directly() after rq2 is inserted to dispatch
> > > > > > > > > > list, the loop is broken and blk_mq_try_issue_list_directly() returns to
> > > > > > > > > > blk_mq_sched_insert_requests() in which list_empty() is false, so
> > > > > > > > > > blk_mq_insert_requests() and blk_mq_run_hw_queue() are called, queue
> > > > > > > > > > is still run.
> > > > > > > > > > 
> > > > > > > > > > Also not sure why you make rq3 involved, since the list is local list on
> > > > > > > > > > stack, and it can be operated concurrently.
> > > > > > > > > 
> > > > > > > > > I make rq3 involved because there are some conditions that
> > > > > > > > > blk_mq_insert_requests() and blk_mq_run_hw_queue() won't be called from
> > > > > > > > > blk_mq_sched_insert_requests():
> > > > > > > > 
> > > > > > > > The two won't be called if list_empty() is true, and will be called if
> > > > > > > > !list_empty().
> > > > > > > > 
> > > > > > > > That is why I mentioned run queue has been done after rq2 is added to
> > > > > > > > ->dispatch_list.
> > > > > > > 
> > > > > > > I don't follow here, it's right after rq2 is inserted to dispatch list,
> > > > > > > list is not empty, and blk_mq_sched_insert_requests() will be called.
> > > > > > > However, do you think that it's impossible that
> > > > > > > blk_mq_sched_insert_requests() can dispatch rq in the list and list
> > > > > > > will become empty?
> > > > > > 
> > > > > > Please take a look at blk_mq_sched_insert_requests().
> > > > > > 
> > > > > > When codes runs into blk_mq_sched_insert_requests(), the following
> > > > > > blk_mq_run_hw_queue() will be run always, how does list empty or not
> > > > > > make a difference there?
> > > > > 
> > > > > This is strange, always blk_mq_run_hw_queue() is exactly what Yufen
> > > > > tries to do in this patch, are we look at different code?
> > > > 
> > > > No.
> > > > 
> > > > > 
> > > > > I'm copying blk_mq_sched_insert_requests() here, the code is from
> > > > > latest linux-next:
> > > > > 
> > > > > 461 void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
> > > > > 462                                 ┊ struct blk_mq_ctx *ctx,
> > > > > 463                                 ┊ struct list_head *list, bool
> > > > > run_queue_async)
> > > > > 464 {
> > > > > 465         struct elevator_queue *e;
> > > > > 466         struct request_queue *q = hctx->queue;
> > > > > 467
> > > > > 468         /*
> > > > > 469         ┊* blk_mq_sched_insert_requests() is called from flush plug
> > > > > 470         ┊* context only, and hold one usage counter to prevent queue
> > > > > 471         ┊* from being released.
> > > > > 472         ┊*/
> > > > > 473         percpu_ref_get(&q->q_usage_counter);
> > > > > 474
> > > > > 475         e = hctx->queue->elevator;
> > > > > 476         if (e) {
> > > > > 477                 e->type->ops.insert_requests(hctx, list, false);
> > > > > 478         } else {
> > > > > 479                 /*
> > > > > 480                 ┊* try to issue requests directly if the hw queue isn't
> > > > > 481                 ┊* busy in case of 'none' scheduler, and this way may
> > > > > save
> > > > > 482                 ┊* us one extra enqueue & dequeue to sw queue.
> > > > > 483                 ┊*/
> > > > > 484                 if (!hctx->dispatch_busy && !run_queue_async) {
> > > > > 485                         blk_mq_run_dispatch_ops(hctx->queue,
> > > > > 486                                 blk_mq_try_issue_list_directly(hctx,
> > > > > list));
> > > > > 487                         if (list_empty(list))
> > > > > 488                                 goto out;
> > > > > 489                 }
> > > > > 490                 blk_mq_insert_requests(hctx, ctx, list);
> > > > > 491         }
> > > > > 492
> > > > > 493         blk_mq_run_hw_queue(hctx, run_queue_async);
> > > > > 494  out:
> > > > > 495         percpu_ref_put(&q->q_usage_counter);
> > > > > 496 }
> > > > > 
> > > > > Here in line 487, if list_empty() is true, out label will skip
> > > > > run_queue().
> > > > 
> > > > If list_empty() is true, run queue is guaranteed to run
> > > > in blk_mq_try_issue_list_directly() in case that BLK_STS_*RESOURCE
> > > > is returned from blk_mq_request_issue_directly().
> > > > 
> > > > 		ret = blk_mq_request_issue_directly(rq, list_empty(list));
> > > > 		if (ret != BLK_STS_OK) {
> > > > 			if (ret == BLK_STS_RESOURCE ||
> > > > 					ret == BLK_STS_DEV_RESOURCE) {
> > > > 				blk_mq_request_bypass_insert(rq, false,
> > > > 							list_empty(list));	//run queue
> > > > 				break;
> > > > 			}
> > > > 			blk_mq_end_request(rq, ret);
> > > > 			errors++;
> > > > 		} else
> > > > 			queued++;
> > > > 
> > > > So why do you try to add one extra run queue?
> > > 
> > > Hi, Ming
> > > 
> > > Perhaps I didn't explain the scenario clearly, please notice that list
> > > contain three rq is required.
> > > 
> > > 1) rq1 is dispatched successfuly
> > > 2) rq2 failed to dispatch due to no budget, in this case
> > >     - rq2 will insert to dispatch list
> > >     - list is not emply yet, run queue won't called
> > 
> > In the case, blk_mq_try_issue_list_directly() returns to
> > blk_mq_sched_insert_requests() immediately, then blk_mq_insert_requests()
> > and blk_mq_run_hw_queue() will be run from blk_mq_sched_insert_requests()
> > because the list isn't empty.
> > 
> > Right?
> > 
> 
> hi Ming,
> 
> Here rq2 fail from blk_mq_plug_issue_direct() in blk_mq_flush_plug_list(),
> not blk_mq_sched_insert_requests

OK, just wondering why Yufen's patch touches
blk_mq_sched_insert_requests().

Here the issue is in blk_mq_plug_issue_direct() itself, it is wrong to use last
request of plug list to decide if run queue is needed since all the remained
requests in plug list may be from other hctxs, and the simplest fix could be pass
run_queue as true always to blk_mq_request_bypass_insert().

Thanks,
Ming