Re: [PATCH] block: align max append sectors to physical block size

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Fri, 17 Jul 2020 09:19:50 +0000

On 2020/07/17 18:12, Ming Lei wrote:
> On Fri, Jul 17, 2020 at 08:22:45AM +0000, Damien Le Moal wrote:
>> On 2020/07/17 16:50, Ming Lei wrote:
>>> On Fri, Jul 17, 2020 at 02:45:25AM +0000, Damien Le Moal wrote:
>>>> On 2020/07/16 23:35, Christoph Hellwig wrote:
>>>>> On Thu, Jul 16, 2020 at 07:09:33PM +0900, Johannes Thumshirn wrote:
>>>>>> Max append sectors needs to be aligned to physical block size, otherwise
>>>>>> we can end up in a situation where it's off by 1-3 sectors which would
>>>>>> cause short writes with asynchronous zone append submissions from an FS.
>>>>>
>>>>> Huh? The physical block size is purely a hint.
>>>>
>>>> For ZBC/ZAC SMR drives, all writes must be aligned to the physical sector size.
>>>
>>> Then the physical block size should be same with logical block size.
>>> The real workable limit for io request is aligned with logical block size.
>>
>> Yes, I know. This T10/T13 design is not the brightest thing they did... on 512e
>> SMR drives, addressing is LBA=512B unit, but all writes in sequential zones must
>> be 4K aligned (8 LBAs).
> 
> Then the issue isn't related with zone append command only. Just wondering how this
> special write block size alignment is enhanced in sequential zones. So far, write
> from FS or raw block size is only logical block size aligned.

This is not enforced in sd/sd_zbc.c. If the user issues a non 4K aligned
request, it will get back an "unaligned write" error from the drive. zonefs and
dm-zoned define a 4K block size to avoid that. For applications doing raw block
device accesses, they have to issue properly aligned writes.

> 
>>
>>>
>>>> However, sd/sd_zbc does not change max_hw_sectors_kb to ensure alignment to 4K
>>>> on 512e disks. There is also nullblk which uses the default max_hw_sectors_kb to
>>>> 255 x 512B sectors, which is not 4K aligned if the nullb device is created with
>>>> 4K block size.
>>>
>>> Actually the real limit is from max_sectors_kb which is <= max_hw_sectors_kb, and
>>> both should be aligned with logical block size, IMO.
>>
>> Yes, agreed, but for nullblk device created with block size = 4K it is not. So
> 
> That is because the default magic number of BLK_SAFE_MAX_SECTORS.
> 
>> one driver to patch for sure. However, I though having some forced alignment in
>> blk_queue_max_hw_sectors() for limit->max_hw_sectors and limit->max_sectors
>> would avoid tripping on weird values for weird drives...
> 
> Maybe we can update it once the logical block size is available, such
> as:
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 9a2c23cd9700..f9cbaadaa267 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -311,6 +311,14 @@ void blk_queue_max_segment_size(struct request_queue *q, unsigned int max_size)
>  }
>  EXPORT_SYMBOL(blk_queue_max_segment_size);
>  
> +static unsigned blk_queue_round_sectors(struct request_queue *q,
> +		unsigned sectors)
> +{
> +	u64 bytes = sectors << 9;
> +
> +	return (unsigned)round_down(bytes, queue_logical_block_size(q));
> +}
> +
>  /**
>   * blk_queue_logical_block_size - set logical block size for the queue
>   * @q:  the request queue for the device
> @@ -330,6 +338,9 @@ void blk_queue_logical_block_size(struct request_queue *q, unsigned int size)
>  
>  	if (q->limits.io_min < q->limits.physical_block_size)
>  		q->limits.io_min = q->limits.physical_block_size;
> +
> +	q->limits.max_sectors = blk_queue_round_sectors(q,
> +			q->limits.max_sectors)
>  }
>  EXPORT_SYMBOL(blk_queue_logical_block_size);

Yes, something like this was what I had in mind so that 4Kn drives get a
sensible value aligned to the 4K LBA, always. However, with the above, there is
no guarantee that max_sectors is already set when the logical block size is set.
I am not sure about the reverse either. So the rounding may need to be in both
blk_queue_logical_block_size() and blk_queue_max_hw_sectors().

-- 
Damien Le Moal
Western Digital Research