Re: [PATCH 0/5] blk-mq: allow to run queue if queue refcount is held

Ming Lei <ming.lei@xxxxxxxxxx> · Wed, 3 Apr 2019 11:20:54 +0800

On Tue, Apr 02, 2019 at 10:53:00AM -0700, Bart Van Assche wrote:
> On Tue, 2019-04-02 at 19:05 +0800, Ming Lei wrote:
> > On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote:
> > > percpu_ref is born for fast path.
> > > There are some drivers use it in completion path, such as scsi, does it really
> > > matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
> > > which is the really bulk and depend on hctx restart mechanism.
> > 
> > Yes, it is designed for fast path, but it doesn't mean percpu_ref
> > hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices,
> > includes the fast NVMe.
> 
> I think the overhead of adding a percpu_ref_get/put pair is acceptable for
> SCSI drivers. The NVMe driver doesn't call blk_mq_run_hw_queues() directly.
> Additionally, I don't think that any of the blk_mq_run_hw_queues() calls from
> the block layer matter for the fast path code in the NVMe driver. In other
> words, adding a percpu_ref_get/put pair in blk_mq_run_hw_queues() shouldn't
> affect the performance of the NVMe driver.

But it can be avoided easily and cleanly, why abuse it for protecting hctx?

> 
> > Also:
> > 
> > It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues
> > only, given the idea is to use the percpu_ref to protect hctx's resources.
> > 
> > There are lots of uses on 'hctx', such as other exported blk-mq APIs.
> > If this approach were chosen, we may have to audit other blk-mq APIs,
> > cause they might be called after queue is frozen too.
> 
> The only blk_mq_hw_ctx user I have found so far that needs additional
> protection is the q->mq_ops->poll() call in blk_poll(). However, that is not
> a new issue. Functions like nvme_poll() access data structures (NVMe
> completion queue) that shouldn't be accessed while blk_cleanup_queue() is in
> progress. If blk_poll() is modified such that it becomes safe to call that
> function while blk_cleanup_queue() is in progress then blk_poll() won't
> access any hardware queue that it shouldn't access.

There can be lots of such case:

1) blk_mq_run_hw_queue() from blk_mq_flush_plug_list()
- requests can be completed just after added to ctx queue or scheduler queue
becasue there can be concurrent run queue, then queue freezing may be done

- then the following blk_mq_run_hw_queue() in blk_mq_sched_insert_requests()
may see freed hctx fields

2) blk_mq_delay_run_hw_queue
- what if it is called after blk_sync_queue() is done in
  blk_cleanup_queue()
- but the caller follows the old rule by holding request queue's
  refcount

3) blk_mq_quiesce_queue
- called after blk_mq_free_queue() is done, then use-after-free on hctx->srcu
- but the caller follows the old rule by holding request queue's

...

Thanks,
Ming