[PATCH 0/5] blk-mq: allow to run queue if queue refcount is held

"jianchao.wang" <jianchao.w.wang@xxxxxxxxxx> · Tue, 2 Apr 2019 16:07:04 +0800

Hi Ming

On 4/2/19 10:55 AM, Ming Lei wrote:
> On Tue, Apr 02, 2019 at 10:02:43AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 4/1/19 6:03 PM, Ming Lei wrote:
>>> On Mon, Apr 01, 2019 at 05:19:01PM +0800, jianchao.wang wrote:
>>>> Hi Ming
>>>>
>>>> On 4/1/19 11:28 AM, Ming Lei wrote:
>>>>> On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
>>>>>> Hi Ming
>>>>>>
>>>>>> On 4/1/19 10:52 AM, Ming Lei wrote:
>>>>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>>>>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>>>>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>>>>>>>> already calls blk_set_queue_dying() and that last function calls
>>>>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>>>>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>>>>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>>>>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>>>>>>>> sense.
>>>>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
>>>>>>> move freeing of hw queue's resource into blk_release_queue() like what
>>>>>>> the patchset is doing.
>>>>>>>
>>>>>>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
>>>>>>> do we?
>>>>>>
>>>>>> IMO, if we could get a way to prevent any attempt to run queue, it would be
>>>>>> better and clearer.
>>>>>
>>>>> It is hard to do that way, and not necessary.
>>>>>
>>>>> I will post V2 soon for review.
>>>>>
>>>>
>>>> Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
>>>> requet_queue is frozen and drained (run queue is also unnecessary because there is no
>>>> entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
>>>> We have similar one in blk_mq_timeout_work.
>>>
>>> If percpu_ref_tryget() is used, percpu_ref_exit() has to be moved into
>>> queue's release handler.
>>>
>>> Then we still have to move freeing hctx's resource into hctx or queue's
>>> release handler, that is exactly what this patch is doing. Then
>>> percpu_ref_tryget() becomes unnecessary again, right?
>>
>> I'm not sure about the percpu_ref_exit. Perhaps I have some misunderstanding about it.
>>
>> From the code of it, it frees the percpu_count and set ref->percpu_count_ptr to __PERCPU_REF_ATOMIC_DEAD.
>> The comment says 'the caller is responsible for ensuring that @ref is no longer in active use'
>> But if we use it after kill, does it count a active use ?
>> Based on the code, the __ref_is_percpu is always false during this, and percpu_ref_tryget will not
>> touch the freed percpu counter but just the atomic ref->count.
>>
>> It looks safe.
> 
> OK, you are right.
> 
> However, I still think it isn't necessary to hold the perpcu_ref in the
> very fast io path.

percpu_ref is born for fast path.
There are some drivers use it in completion path, such as scsi, does it really
matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
which is the really bulk and depend on hctx restart mechanism.

> 
>>
>>
>>>
>>>>
>>>> freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
>>>> the started ones, then hctx->run_queue is cleaned totally.
>>>>
>>>> IMO, it would be better to have a checkpoint after which there will be no any in-flight
>>>> asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
>>>> and any attempt to start them will fail.
>>>
>>> All are canceled in blk_cleanup_queue(), but not enough, given queue can
>>> be run in sync mode(such as via plug, direct issue, ...), or driver's
>>> requeue, such as SCSI's requeue. SCSI's requeue may run other LUN's queue
>>> just by holding queue's kobject refcount.
>>
>> Yes, so we need a checkpoint here to ensure the request_queue to enter into a certain state.
>> We provide a guarantee that all of the activities are stopped after this checkpoint.
>> It will be convenient for us to do other things following, for example release request_queue's
>> resource.
> 
> We have such checkpoint already:
> 
> 	blk_freeze_queue() together with blk_sync_queue()
> 
> Once the two are done, there shouldn't be any driver activities at all.
> 
> The current issue is related with blk-mq internal implementation, in which
> it should have been safe to complete the run queue activity during queue
> cleanup if the request queue's kobject refcount isn't released.
> 
> However, 45a9c9d909b2 ("blk-mq: Fix a use-after-free") frees hctx
> resource too early, and causes the kernel oops.
> 
> Also, isn't it the typical practice to release kobject related resources in
> its release handler?

I agree with this.

Thanks
Jianchao