On 4/1/19 3:15 PM, Ming Lei wrote: > On Mon, Apr 1, 2019 at 1:27 PM Dongli Zhang <dongli.zhang@xxxxxxxxxx> wrote: >> >> >> >> On 4/1/19 1:16 PM, Ming Lei wrote: >>> Hi Dongli, >>> >>> On Mon, Apr 01, 2019 at 01:05:46PM +0800, Dongli Zhang wrote: >>>> >>>> >>>> On 4/1/19 10:52 AM, Ming Lei wrote: >>>>> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote: >>>>>> On 3/31/19 7:00 PM, Ming Lei wrote: >>>>>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote: >>>>>>>> I'm not sure the approach of this patch series is really the direction we >>>>>>>> should pursue. There are many block driver that free resources immediately >>>>>>> >>>>>>> Please see scsi_run_queue(), and the queue refcount is always held >>>>>>> before run queue. >>>>>> >>>>>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when >>>>>> scsi_run_queue() is called from inside scsi_requeue_run_queue(). >>>>> >>>>> We don't need the guarantee of 'q->q_usage_counter > 0', I mean the >>>>> queue's kobj reference counter. >>>>> >>>>> What we need is to allow run queue to work correctly after queue is frozen >>>>> or cleaned up. >>>>> >>>>>> >>>>>>>> I'd like to avoid having to modify all block drivers that free resources >>>>>>>> immediately after blk_cleanup_queue() has returned. Have you considered to >>>>>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that >>>>>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a >>>>>>>> percpu_ref_tryget_live(&q->q_usage_counter) / >>>>>>>> percpu_ref_put(&q->q_usage_counter) pair? >>>>>>> >>>>>>> It can't work because blk_mq_run_hw_queues may happen after >>>>>>> percpu_ref_exit() is done. >>>>>>> >>>>>>> However, if we move percpu_ref_exit() into queue's release handler, we >>>>>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(), >>>>>>> and we still have to free hw queue resources in queue's release handler, >>>>>>> that is exactly what this patchset is doing. >>>>>>> >>>>>>> In short, getting q->q_usage_counter doesn't make a difference on this >>>>>>> issue. >>>>>> >>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state. >>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead" >>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() >>>>>> already calls blk_set_queue_dying() and that last function calls >>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and >>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in >>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and >>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes >>>>>> sense. >>>>> >>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to >>>>> move freeing of hw queue's resource into blk_release_queue() like what >>>>> the patchset is doing. >>>> >>>> Hi Ming, >>>> >>>> Would you mind help explain why we still need to move freeing of hw queue's >>>> resource into blk_release_queue() like what the patchset is doing? >>>> >>>> Let's assume there is no deadlock when percpu_ref_tryget_live() is used, >>> >>> Could you explain why the assumption is true? >>> >>> We have to run queue after starting to freeze queue for draining >>> allocated requests and making forward progress. Inside blk_freeze_queue_start(), >>> percpu_ref_kill() marks this ref as DEAD, then percpu_ref_tryget_live() returns >>> false, then queue won't be run. >> >> Hi Ming, >> >> I understand the assumption is invalid and there is issue when using >> percpu_ref_tryget_live. And I also understand we have to run queue after >> starting to freeze queue for draining allocated requests and making forward >> progress. > > OK. > >> >> >> I am just wondering specifically on why "If percpu_ref_exit() is moved to >> blk_release_queue(), we still need to move freeing of hw queue's resource into >> blk_release_queue() like what the patchset is doing." based on below Bart's >> statement: >> >> "percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state. >> percpu_ref_kill() changes the state of a per-cpu counter to the "dead" state. >> blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() already >> calls blk_set_queue_dying() and that last function call >> blk_freeze_queue_start(). So I think that what you wrote is not correct and that >> inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in >> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and also >> that moving the percpu_ref_exit() call into blk_release_queue() makes sense." > > As you mentioned, percpu_ref_tryget_live() can't be used to avoid run queue > during cleanup, then run queue can come when queue is cleaned up. > >> >> That's is, what is penalty if we do not move freeing of hw queue's resource >> into blk_release_queue() like what the patchset is doing in above situation? > > kernel oops as reported by James, because some fields of hctx will be used > by run queue, and they can be freed by blk_mq_free_queue() in > blk_cleanup_queue(). > >> >> I ask this question just because I would like to better understand the source >> code. Does "hw queue's resource" indicate the below? >> >> + if (hctx->flags & BLK_MQ_F_BLOCKING) >> + cleanup_srcu_struct(hctx->srcu); >> + blk_free_flush_queue(hctx->fq); >> + sbitmap_free(&hctx->ctx_map); > > Right. Hi Ming, Thank you very much for the detailed explanation. I think maybe I misunderstood your message in the email. In another direction posted by Bart as below, regardless about which direction is better, that implementation does not move freeing of hw queue's resource into blk_release_queue(), although percpu_ref_exit() is moved to blk_release_queue(). That's why I would like to confirm. https://lore.kernel.org/linux-block/20190401212014.192753-1-bvanassche@xxxxxxx/ In that direction, the more friendly percpu_ref_tryget(), which is suggested by Jianchao, is used. I would like just to confirm that there is no need to move freeing of hw queue's resource into blk_release_queue() when the get/put method is friendly and fair enough. I think I should not just focus on only part of your message. Sorry for spamming you. Dongli Zhang