Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held

Ming Lei <ming.lei@xxxxxxxxxx> · Wed, 3 Apr 2019 16:43:05 +0800

On Wed, Apr 03, 2019 at 04:29:41PM +0800, Ming Lei wrote:
> On Wed, Apr 03, 2019 at 11:20:53AM +0800, Ming Lei wrote:
> > On Tue, Apr 02, 2019 at 10:53:00AM -0700, Bart Van Assche wrote:
> > > On Tue, 2019-04-02 at 19:05 +0800, Ming Lei wrote:
> > > > On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote:
> > > > > percpu_ref is born for fast path.
> > > > > There are some drivers use it in completion path, such as scsi, does it really
> > > > > matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
> > > > > which is the really bulk and depend on hctx restart mechanism.
> > > > 
> > > > Yes, it is designed for fast path, but it doesn't mean percpu_ref
> > > > hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices,
> > > > includes the fast NVMe.
> > > 
> > > I think the overhead of adding a percpu_ref_get/put pair is acceptable for
> > > SCSI drivers. The NVMe driver doesn't call blk_mq_run_hw_queues() directly.
> > > Additionally, I don't think that any of the blk_mq_run_hw_queues() calls from
> > > the block layer matter for the fast path code in the NVMe driver. In other
> > > words, adding a percpu_ref_get/put pair in blk_mq_run_hw_queues() shouldn't
> > > affect the performance of the NVMe driver.
> > 
> > But it can be avoided easily and cleanly, why abuse it for protecting hctx?
> > 
> > > 
> > > > Also:
> > > > 
> > > > It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues
> > > > only, given the idea is to use the percpu_ref to protect hctx's resources.
> > > > 
> > > > There are lots of uses on 'hctx', such as other exported blk-mq APIs.
> > > > If this approach were chosen, we may have to audit other blk-mq APIs,
> > > > cause they might be called after queue is frozen too.
> > > 
> > > The only blk_mq_hw_ctx user I have found so far that needs additional
> > > protection is the q->mq_ops->poll() call in blk_poll(). However, that is not
> > > a new issue. Functions like nvme_poll() access data structures (NVMe
> > > completion queue) that shouldn't be accessed while blk_cleanup_queue() is in
> > > progress. If blk_poll() is modified such that it becomes safe to call that
> > > function while blk_cleanup_queue() is in progress then blk_poll() won't
> > > access any hardware queue that it shouldn't access.
> > 
> > There can be lots of such case:
> > 
> > 1) blk_mq_run_hw_queue() from blk_mq_flush_plug_list()
> > - requests can be completed just after added to ctx queue or scheduler queue
> > becasue there can be concurrent run queue, then queue freezing may be done
> > 
> > - then the following blk_mq_run_hw_queue() in blk_mq_sched_insert_requests()
> > may see freed hctx fields
> 
> Actually this one is blk-mq internal race, and queue's refcount isn't
> guaranteed to be held when blk_mq_run_hw_queue is called.
> 
> We might have to address this one by grabbing .q_usage_count in
> blk_mq_sched_insert_requests just like commit 8dc765d438f1 ("SCSI: fix queue cleanup
> race before queue initialization is done"), but I do want to avoid it.

The issue is very similar with kiocb's refcount in aio/io_uring, we might
need to grab two for handling one request, one is for submission side, and
another is for completion side. Now the issue is that we don't grab the
ref in submission side.

Thanks,
Ming