On Tue, Apr 2, 2019 at 10:06 AM Dongli Zhang <dongli.zhang@xxxxxxxxxx> wrote: > > > > On 4/1/19 3:15 PM, Ming Lei wrote: > > On Mon, Apr 1, 2019 at 1:27 PM Dongli Zhang <dongli.zhang@xxxxxxxxxx> wrote: > >> > >> > >> > >> On 4/1/19 1:16 PM, Ming Lei wrote: > >>> Hi Dongli, > >>> > >>> On Mon, Apr 01, 2019 at 01:05:46PM +0800, Dongli Zhang wrote: > >>>> > >>>> > >>>> On 4/1/19 10:52 AM, Ming Lei wrote: > >>>>> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote: > >>>>>> On 3/31/19 7:00 PM, Ming Lei wrote: > >>>>>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote: > >>>>>>>> I'm not sure the approach of this patch series is really the direction we > >>>>>>>> should pursue. There are many block driver that free resources immediately > >>>>>>> > >>>>>>> Please see scsi_run_queue(), and the queue refcount is always held > >>>>>>> before run queue. > >>>>>> > >>>>>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when > >>>>>> scsi_run_queue() is called from inside scsi_requeue_run_queue(). > >>>>> > >>>>> We don't need the guarantee of 'q->q_usage_counter > 0', I mean the > >>>>> queue's kobj reference counter. > >>>>> > >>>>> What we need is to allow run queue to work correctly after queue is frozen > >>>>> or cleaned up. > >>>>> > >>>>>> > >>>>>>>> I'd like to avoid having to modify all block drivers that free resources > >>>>>>>> immediately after blk_cleanup_queue() has returned. Have you considered to > >>>>>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that > >>>>>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a > >>>>>>>> percpu_ref_tryget_live(&q->q_usage_counter) / > >>>>>>>> percpu_ref_put(&q->q_usage_counter) pair? > >>>>>>> > >>>>>>> It can't work because blk_mq_run_hw_queues may happen after > >>>>>>> percpu_ref_exit() is done. > >>>>>>> > >>>>>>> However, if we move percpu_ref_exit() into queue's release handler, we > >>>>>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(), > >>>>>>> and we still have to free hw queue resources in queue's release handler, > >>>>>>> that is exactly what this patchset is doing. > >>>>>>> > >>>>>>> In short, getting q->q_usage_counter doesn't make a difference on this > >>>>>>> issue. > >>>>>> > >>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state. > >>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead" > >>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() > >>>>>> already calls blk_set_queue_dying() and that last function calls > >>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and > >>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in > >>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and > >>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes > >>>>>> sense. > >>>>> > >>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to > >>>>> move freeing of hw queue's resource into blk_release_queue() like what > >>>>> the patchset is doing. > >>>> > >>>> Hi Ming, > >>>> > >>>> Would you mind help explain why we still need to move freeing of hw queue's > >>>> resource into blk_release_queue() like what the patchset is doing? > >>>> > >>>> Let's assume there is no deadlock when percpu_ref_tryget_live() is used, > >>> > >>> Could you explain why the assumption is true? > >>> > >>> We have to run queue after starting to freeze queue for draining > >>> allocated requests and making forward progress. Inside blk_freeze_queue_start(), > >>> percpu_ref_kill() marks this ref as DEAD, then percpu_ref_tryget_live() returns > >>> false, then queue won't be run. > >> > >> Hi Ming, > >> > >> I understand the assumption is invalid and there is issue when using > >> percpu_ref_tryget_live. And I also understand we have to run queue after > >> starting to freeze queue for draining allocated requests and making forward > >> progress. > > > > OK. > > > >> > >> > >> I am just wondering specifically on why "If percpu_ref_exit() is moved to > >> blk_release_queue(), we still need to move freeing of hw queue's resource into > >> blk_release_queue() like what the patchset is doing." based on below Bart's > >> statement: > >> > >> "percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state. > >> percpu_ref_kill() changes the state of a per-cpu counter to the "dead" state. > >> blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() already > >> calls blk_set_queue_dying() and that last function call > >> blk_freeze_queue_start(). So I think that what you wrote is not correct and that > >> inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in > >> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and also > >> that moving the percpu_ref_exit() call into blk_release_queue() makes sense." > > > > As you mentioned, percpu_ref_tryget_live() can't be used to avoid run queue > > during cleanup, then run queue can come when queue is cleaned up. > > > >> > >> That's is, what is penalty if we do not move freeing of hw queue's resource > >> into blk_release_queue() like what the patchset is doing in above situation? > > > > kernel oops as reported by James, because some fields of hctx will be used > > by run queue, and they can be freed by blk_mq_free_queue() in > > blk_cleanup_queue(). > > > >> > >> I ask this question just because I would like to better understand the source > >> code. Does "hw queue's resource" indicate the below? > >> > >> + if (hctx->flags & BLK_MQ_F_BLOCKING) > >> + cleanup_srcu_struct(hctx->srcu); > >> + blk_free_flush_queue(hctx->fq); > >> + sbitmap_free(&hctx->ctx_map); > > > > Right. > > Hi Ming, > > Thank you very much for the detailed explanation. > > I think maybe I misunderstood your message in the email. > > In another direction posted by Bart as below, regardless about which direction > is better, that implementation does not move freeing of hw queue's resource into > blk_release_queue(), although percpu_ref_exit() is moved to blk_release_queue(). > That's why I would like to confirm. > > https://lore.kernel.org/linux-block/20190401212014.192753-1-bvanassche@xxxxxxx/ > > In that direction, the more friendly percpu_ref_tryget(), which is suggested by > Jianchao, is used. I would like just to confirm that there is no need to move > freeing of hw queue's resource into blk_release_queue() when the get/put method > is friendly and fair enough. I don't think it is friendly to add such unnecessary stuff in the very fast path. Thanks, Ming Lei