On Wed, Apr 03, 2019 at 11:20:53AM +0800, Ming Lei wrote: > On Tue, Apr 02, 2019 at 10:53:00AM -0700, Bart Van Assche wrote: > > On Tue, 2019-04-02 at 19:05 +0800, Ming Lei wrote: > > > On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote: > > > > percpu_ref is born for fast path. > > > > There are some drivers use it in completion path, such as scsi, does it really > > > > matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues > > > > which is the really bulk and depend on hctx restart mechanism. > > > > > > Yes, it is designed for fast path, but it doesn't mean percpu_ref > > > hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices, > > > includes the fast NVMe. > > > > I think the overhead of adding a percpu_ref_get/put pair is acceptable for > > SCSI drivers. The NVMe driver doesn't call blk_mq_run_hw_queues() directly. > > Additionally, I don't think that any of the blk_mq_run_hw_queues() calls from > > the block layer matter for the fast path code in the NVMe driver. In other > > words, adding a percpu_ref_get/put pair in blk_mq_run_hw_queues() shouldn't > > affect the performance of the NVMe driver. > > But it can be avoided easily and cleanly, why abuse it for protecting hctx? > > > > > > Also: > > > > > > It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues > > > only, given the idea is to use the percpu_ref to protect hctx's resources. > > > > > > There are lots of uses on 'hctx', such as other exported blk-mq APIs. > > > If this approach were chosen, we may have to audit other blk-mq APIs, > > > cause they might be called after queue is frozen too. > > > > The only blk_mq_hw_ctx user I have found so far that needs additional > > protection is the q->mq_ops->poll() call in blk_poll(). However, that is not > > a new issue. Functions like nvme_poll() access data structures (NVMe > > completion queue) that shouldn't be accessed while blk_cleanup_queue() is in > > progress. If blk_poll() is modified such that it becomes safe to call that > > function while blk_cleanup_queue() is in progress then blk_poll() won't > > access any hardware queue that it shouldn't access. > > There can be lots of such case: > > 1) blk_mq_run_hw_queue() from blk_mq_flush_plug_list() > - requests can be completed just after added to ctx queue or scheduler queue > becasue there can be concurrent run queue, then queue freezing may be done > > - then the following blk_mq_run_hw_queue() in blk_mq_sched_insert_requests() > may see freed hctx fields Actually this one is blk-mq internal race, and queue's refcount isn't guaranteed to be held when blk_mq_run_hw_queue is called. We might have to address this one by grabbing .q_usage_count in blk_mq_sched_insert_requests just like commit 8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done"), but I do want to avoid it. Thanks, Ming