On Mon, Apr 29, 2019 at 11:09:39AM -0700, Bart Van Assche wrote: > On Sun, 2019-04-28 at 16:14 +0800, Ming Lei wrote: > > Just like aio/io_uring, we need to grab 2 refcount for queuing one > > request, one is for submission, another is for completion. > > > > If the request isn't queued from plug code path, the refcount grabbed > > in generic_make_request() serves for submission. In theroy, this > > refcount should have been released after the sumission(async run queue) > > is done. blk_freeze_queue() works with blk_sync_queue() together > > for avoiding race between cleanup queue and IO submission, given async > > run queue activities are canceled because hctx->run_work is scheduled with > > the refcount held, so it is fine to not hold the refcount when > > running the run queue work function for dispatch IO. > > > > However, if request is staggered into plug list, and finally queued > > from plug code path, the refcount in submission side is actually missed. > > And we may start to run queue after queue is removed because the queue's > > kobject refcount isn't guaranteed to be grabbed in flushing plug list > > context, then kernel oops is triggered, see the following race: > > > > blk_mq_flush_plug_list(): > > blk_mq_sched_insert_requests() > > insert requests to sw queue or scheduler queue > > blk_mq_run_hw_queue > > > > Because of concurrent run queue, all requests inserted above may be > > completed before calling the above blk_mq_run_hw_queue. Then queue can > > be freed during the above blk_mq_run_hw_queue(). > > > > Fixes the issue by grab .q_usage_counter before calling > > blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is > > safe because the queue is absolutely alive before inserting request. > > > > Cc: Dongli Zhang <dongli.zhang@xxxxxxxxxx> > > Cc: James Smart <james.smart@xxxxxxxxxxxx> > > Cc: Bart Van Assche <bart.vanassche@xxxxxxx> > > Cc: linux-scsi@xxxxxxxxxxxxxxx, > > Cc: Martin K . Petersen <martin.petersen@xxxxxxxxxx>, > > Cc: Christoph Hellwig <hch@xxxxxx>, > > Cc: James E . J . Bottomley <jejb@xxxxxxxxxxxxxxxxxx>, > > Reviewed-by: Bart Van Assche <bvanassche@xxxxxxx> > > Reviewed-by: Johannes Thumshirn <jthumshirn@xxxxxxx> > > Reviewed-by: Hannes Reinecke <hare@xxxxxxxx> > > Tested-by: James Smart <james.smart@xxxxxxxxxxxx> > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > I added my "Reviewed-by" to a previous version of this patch but not > to this version of this patch. Several "Reviewed-by" tags probably > should be removed. Fine. > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > > index aa6bc5c02643..dfe83e7935d6 100644 > > --- a/block/blk-mq-sched.c > > +++ b/block/blk-mq-sched.c > > @@ -414,6 +414,13 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx, > > { > > struct elevator_queue *e; > > > > + /* > > + * blk_mq_sched_insert_requests() is called from flush plug > > + * context only, and hold one usage counter to prevent queue > > + * from being released. > > + */ > > + percpu_ref_get(&hctx->queue->q_usage_counter); > > + > > e = hctx->queue->elevator; > > if (e && e->type->ops.insert_requests) > > e->type->ops.insert_requests(hctx, list, false); > > @@ -432,6 +439,8 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx, > > } > > > > blk_mq_run_hw_queue(hctx, run_queue_async); > > + > > + percpu_ref_put(&hctx->queue->q_usage_counter); > > } > > I think that 'hctx' can disappear if all requests queued by this function > finish just before blk_mq_run_hw_queue() returns and if the number of hardware > queues is changed from another thread. updating nr_hw_queues needs to freeze queues in the same tagset first, and the added percpu_ref_get() will prevent that from happening. > Shouldn't the request queue pointer be > stored in a local variable instead of reading hctx->queue twice? Yeah, it is much cleaner, will do it in V9. BTW, one big problem in this patch(V8) is that the usage counter isn't put in the early return branch. thanks, Ming