Re: [PATCH V2 11/20] blk-mq: introduce helpers for operating ->dispatch list

Damien Le Moal <damien.lemoal@xxxxxxx> · Thu, 24 Aug 2017 16:42:11 +0900

On 8/24/17 16:10, Ming Lei wrote:
> On Thu, Aug 24, 2017 at 09:59:13AM +0900, Damien Le Moal wrote:
>>
>> On 8/23/17 05:43, Bart Van Assche wrote:
>>> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
>>>> +static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
>>>> +{
>>>> +	return !list_empty_careful(&hctx->dispatch);
>>>> +}
>>>> +
>>>> +static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
>>>> +		struct request *rq)
>>>> +{
>>>> +	spin_lock(&hctx->lock);
>>>> +	list_add(&rq->queuelist, &hctx->dispatch);
>>>> +	blk_mq_hctx_set_dispatch_busy(hctx);
>>>> +	spin_unlock(&hctx->lock);
>>>> +}
>>>> +
>>>> +static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
>>>> +		struct list_head *list)
>>>> +{
>>>> +	spin_lock(&hctx->lock);
>>>> +	list_splice_init(list, &hctx->dispatch);
>>>> +	blk_mq_hctx_set_dispatch_busy(hctx);
>>>> +	spin_unlock(&hctx->lock);
>>>> +}
>>>> +
>>>> +static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx *hctx,
>>>> +						    struct list_head *list)
>>>> +{
>>>> +	spin_lock(&hctx->lock);
>>>> +	list_splice_tail_init(list, &hctx->dispatch);
>>>> +	blk_mq_hctx_set_dispatch_busy(hctx);
>>>> +	spin_unlock(&hctx->lock);
>>>> +}
>>>> +
>>>> +static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx *hctx,
>>>> +		struct list_head *list)
>>>> +{
>>>> +	spin_lock(&hctx->lock);
>>>> +	list_splice_init(&hctx->dispatch, list);
>>>> +	spin_unlock(&hctx->lock);
>>>> +}
>>>
>>> Same comment for this patch: these helper functions are so short that I'm not
>>> sure it is useful to introduce these helper functions.
>>>
>>> Bart.
>>
>> Personally, I like those very much as they give a place to hook up
>> different dispatch_list handling without having to change blk-mq.c and
>> blk-mq-sched.c all over the place.
>>
>> I am thinking of SMR (zoned block device) support here since we need to
>> to sort insert write requests in blk_mq_add_rq_to_dispatch() and
>> blk_mq_add_list_to_dispatch_tail(). For this last one, the name would
> 
> Could you explain a bit why you sort insert write rq in these two
> helpers? which are only triggered in case of dispatch busy.

For host-managed zoned block devices (SMR disks), writes have to be
sequential per zone of the device, otherwise an "unaligned write error"
is returned by the device. This constraint is exposed to the user (FS,
device mapper or applications doing raw I/Os to the block device). So a
well behaved user (f2fs and dm-zoned in the kernel) will issue
sequential writes to zones (this includes write same and write zeroes).
However, the current current blk-mq code does not guarantee that this
issuing order will be respected at dispatch time. There are multiple
reasons:

1) On submit, requests first go into per CPU context list. So if the
issuer thread is moved from one CPU to another in the middle of a
sequential write request sequence, the sequence would be broken in two
parts into different lists. When a scheduler is used, this can get fixed
by a nice LBA ordering (mq-deadline would do that), but in the "nonw"
scheduler case, the requests in the CPU lists are merged into the hctx
dispatch queue in CPU number order, so potentially reordering write
sequence. Hence the sort to fix that.

2) Dispatch may happen in parallel with multiple contexts getting
requests from the scheduler (or the CPU context lists) and again break
write sequences. I think your patch partly addresses this with the BUSY
flag, but when requests are in the dispatch queue already. However, at
dispatch time, requests go directly from the scheduler (or CPU context
lists) into the dispatch caller local list, bypassing the hctx dispatch
queue. Write ordering gets broken again here.

3) SCSI layer locks a device zone when it sees a write to prevent
subsequent write to the same zone. This is to avoid reordering at the
HBA level (AHCI has a 100% chance of doing it). In effect, this is a
write QD=1 per zone implementation, and that causes requeue of write
requests. In that case, this is a requeue happening after ->queue_rq()
call, so using the dispatch context local list. Combine that with (2)
and reordering can happen here too.

I have a series implementing a "dispatcher", which is basically a set of
operations that a device driver can specify to handle the dispatch
queue. The operations are basically "insert" and "get", so what you have
as helpers. The default dispatcher uses operations very similar to your
helpers minus the busy bit. That is, the default dispatcher does not
change the current behavior. But a ZBC dispatcher can be installed by
the scsi layer for a ZBC drive at device scan time and do the sort
insert of write requests to address all the potential reordering.

That series goes beyond what you did, and I was waiting for your series
to stabilize to rebase (or rework) what I did on top of your changes
which look like they would simplify handling of ZBC drives.

Note: ZBC drives are single queue devices, so they always only have a
single hctx.

> 
>> become a little awkward though. This sort insert would be to avoid
>> breaking a sequential write request sequence sent by the disk user. This
>> is needed since this reordering breakage cannot be solved only from the
>> SCSI layer.
> 
> Basically this patchset tries to flush hctx->dispatch first, then
> fetch requests from scheduler queue after htct->dispatch is flushed.
> 
> But it isn't strictly in this way because the implementation is lockless.
> 
> So looks the request from hctx->dispatch won't break one coming
> requests from scheduler.

Yes, I understood that. The result would be the requests staying longer
in the scheduler, increasing chances of merge. I think this is a good
idea. But unfortunately, I do not think that it will solve the ZBC
sequential write case in its own. In particular, this does not change
much to what happens to write request order without a scheduler, nor the
potential reordering on requeue. But I may be completely wrong here. I
need to look at your code more carefully.

I have not yet tested your code on ZBC drives. I will do it as soon as I
can.

Thanks.

Best regards.

-- 
Damien Le Moal,
Western Digital Research