Re: [PATCH 1/3] block: fix blk_rq_get_max_sectors() to flow more carefully

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Tue, 15 Sep 2020 01:09:55 +0000

On 2020/09/15 0:04, Mike Snitzer wrote:
> On Sun, Sep 13 2020 at  8:46pm -0400,
> Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote:
> 
>> On 2020/09/12 6:53, Mike Snitzer wrote:
>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
>>> those operations.
>>>
>>> Also, there is no need to avoid blk_max_size_offset() if
>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
>>>
>>> Signed-off-by: Mike Snitzer <snitzer@xxxxxxxxxx>
>>> ---
>>>  include/linux/blkdev.h | 19 +++++++++++++------
>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index bb5636cc17b9..453a3d735d66 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>>>  						  sector_t offset)
>>>  {
>>>  	struct request_queue *q = rq->q;
>>> +	int op;
>>> +	unsigned int max_sectors;
>>>  
>>>  	if (blk_rq_is_passthrough(rq))
>>>  		return q->limits.max_hw_sectors;
>>>  
>>> -	if (!q->limits.chunk_sectors ||
>>> -	    req_op(rq) == REQ_OP_DISCARD ||
>>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
>>> -		return blk_queue_get_max_sectors(q, req_op(rq));
>>> +	op = req_op(rq);
>>> +	max_sectors = blk_queue_get_max_sectors(q, op);
>>>  
>>> -	return min(blk_max_size_offset(q, offset),
>>> -			blk_queue_get_max_sectors(q, req_op(rq)));
>>> +	switch (op) {
>>> +	case REQ_OP_DISCARD:
>>> +	case REQ_OP_SECURE_ERASE:
>>> +	case REQ_OP_WRITE_SAME:
>>> +	case REQ_OP_WRITE_ZEROES:
>>> +		return max_sectors;
>>> +	}
>>
>> Doesn't this break md devices ? (I think does use chunk_sectors for stride size,
>> no ?)
>>
>> As mentioned in my reply to Ming's email, this will allow these commands to
>> potentially cross over zone boundaries on zoned block devices, which would be an
>> immediate command failure.
> 
> Depending on the implementation it is beneficial to get a large
> discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's
> optimization for handling large discards and issuing N discards, one per
> stripe).  Same could apply for other commands.
> 
> Like all devices, zoned devices should impose command specific limits in
> the queue_limits (and not lean on chunk_sectors to do a
> one-size-fits-all).

Yes, understood. But I think that  in the case of md, chunk_sectors is used to
indicate the boundary between drives for a raid volume. So it does indeed make
sense to limit the IO size on submission since otherwise, the md driver itself
would have to split that bio again anyway.

> But that aside, yes I agree I didn't pay close enough attention to the
> implications of deferring the splitting of these commands until they
> were issued to underlying storage.  This chunk_sectors early splitting
> override is a bit of a mess... not quite following the logic given we
> were supposed to be waiting to split bios as late as possible.

My view is that the multipage bvec (BIOs almost as large as we want) and late
splitting is beneficial to get larger effective BIO sent to the device as having
more pages on hand allows bigger segments in the bio instead of always having at
most PAGE_SIZE per segment. The effect of this is very visible with blktrace. A
lot of requests end up being much larger than the device max_segments * page_size.

However, if there is already a known limit on the BIO size when the BIO is being
built, it does not make much sense to try to grow a bio beyond that limit since
it will have to be split by the driver anyway. chunk_sectors is one such limit
used for md (I think) to indicate boundaries between drives of a raid volume.
And we reuse it (abuse it ?) for zoned block devices to ensure that any command
does not cross over zone boundaries since that triggers errors for writes within
sequential zones or read/write crossing over zones of different types
(conventional->sequential zone boundary).

I may not have the entire picture correctly here, but so far, this is my
understanding.

Cheers.

-- 
Damien Le Moal
Western Digital Research