Re: [PATCH] block: align max append sectors to physical block size

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Fri, 17 Jul 2020 10:55:49 +0000

On 2020/07/17 19:02, Ming Lei wrote:
> On Fri, Jul 17, 2020 at 09:19:50AM +0000, Damien Le Moal wrote:
>> On 2020/07/17 18:12, Ming Lei wrote:
>>> On Fri, Jul 17, 2020 at 08:22:45AM +0000, Damien Le Moal wrote:
>>>> On 2020/07/17 16:50, Ming Lei wrote:
>>>>> On Fri, Jul 17, 2020 at 02:45:25AM +0000, Damien Le Moal wrote:
>>>>>> On 2020/07/16 23:35, Christoph Hellwig wrote:
>>>>>>> On Thu, Jul 16, 2020 at 07:09:33PM +0900, Johannes Thumshirn wrote:
>>>>>>>> Max append sectors needs to be aligned to physical block size, otherwise
>>>>>>>> we can end up in a situation where it's off by 1-3 sectors which would
>>>>>>>> cause short writes with asynchronous zone append submissions from an FS.
>>>>>>>
>>>>>>> Huh? The physical block size is purely a hint.
>>>>>>
>>>>>> For ZBC/ZAC SMR drives, all writes must be aligned to the physical sector size.
>>>>>
>>>>> Then the physical block size should be same with logical block size.
>>>>> The real workable limit for io request is aligned with logical block size.
>>>>
>>>> Yes, I know. This T10/T13 design is not the brightest thing they did... on 512e
>>>> SMR drives, addressing is LBA=512B unit, but all writes in sequential zones must
>>>> be 4K aligned (8 LBAs).
>>>
>>> Then the issue isn't related with zone append command only. Just wondering how this
>>> special write block size alignment is enhanced in sequential zones. So far, write
>>> from FS or raw block size is only logical block size aligned.
>>
>> This is not enforced in sd/sd_zbc.c. If the user issues a non 4K aligned
>> request, it will get back an "unaligned write" error from the drive. zonefs and
>> dm-zoned define a 4K block size to avoid that. For applications doing raw block
>> device accesses, they have to issue properly aligned writes.
> 
> OK, then I guess either:
> 
> 1) the same write alignment issue for zone append can be handled in
> same way with normal write on seq zone

Yes, absolutely. I think this is fine. The point was that we at least should be
exposing a meaningful maximum command size for that, so that the user does not
have to align that maximum too.

> 
> OR
> 
> 2) add one new limit for write on seq zone, such as: zone_write_block_size
> 
> Then the two cases can be dealt with in same way, and physical block
> size is usually a hint as Christoph mentioned, looks a bit weird to use
> it in this way, or at least the story should be documented.

Yeah, but zone_write_block_size would end up always being equal to the physical
block size for SMR. For ZNS and nullblk, logical block size == physical block
size, always, so it would not be useful. I do not think such change is necessary.

> 
>>
>>>
>>>>
>>>>>
>>>>>> However, sd/sd_zbc does not change max_hw_sectors_kb to ensure alignment to 4K
>>>>>> on 512e disks. There is also nullblk which uses the default max_hw_sectors_kb to
>>>>>> 255 x 512B sectors, which is not 4K aligned if the nullb device is created with
>>>>>> 4K block size.
>>>>>
>>>>> Actually the real limit is from max_sectors_kb which is <= max_hw_sectors_kb, and
>>>>> both should be aligned with logical block size, IMO.
>>>>
>>>> Yes, agreed, but for nullblk device created with block size = 4K it is not. So
>>>
>>> That is because the default magic number of BLK_SAFE_MAX_SECTORS.
>>>
>>>> one driver to patch for sure. However, I though having some forced alignment in
>>>> blk_queue_max_hw_sectors() for limit->max_hw_sectors and limit->max_sectors
>>>> would avoid tripping on weird values for weird drives...
>>>
>>> Maybe we can update it once the logical block size is available, such
>>> as:
>>>
>>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>>> index 9a2c23cd9700..f9cbaadaa267 100644
>>> --- a/block/blk-settings.c
>>> +++ b/block/blk-settings.c
>>> @@ -311,6 +311,14 @@ void blk_queue_max_segment_size(struct request_queue *q, unsigned int max_size)
>>>  }
>>>  EXPORT_SYMBOL(blk_queue_max_segment_size);
>>>  
>>> +static unsigned blk_queue_round_sectors(struct request_queue *q,
>>> +		unsigned sectors)
>>> +{
>>> +	u64 bytes = sectors << 9;
>>> +
>>> +	return (unsigned)round_down(bytes, queue_logical_block_size(q));
>>> +}
>>> +
>>>  /**
>>>   * blk_queue_logical_block_size - set logical block size for the queue
>>>   * @q:  the request queue for the device
>>> @@ -330,6 +338,9 @@ void blk_queue_logical_block_size(struct request_queue *q, unsigned int size)
>>>  
>>>  	if (q->limits.io_min < q->limits.physical_block_size)
>>>  		q->limits.io_min = q->limits.physical_block_size;
>>> +
>>> +	q->limits.max_sectors = blk_queue_round_sectors(q,
>>> +			q->limits.max_sectors)
>>>  }
>>>  EXPORT_SYMBOL(blk_queue_logical_block_size);
>>
>> Yes, something like this was what I had in mind so that 4Kn drives get a
>> sensible value aligned to the 4K LBA, always. However, with the above, there is
>> no guarantee that max_sectors is already set when the logical block size is set.
>> I am not sure about the reverse either. So the rounding may need to be in both
>> blk_queue_logical_block_size() and blk_queue_max_hw_sectors().
> 
> OK, that looks better.

OK.

Johannes, care to resend your patch ?

> 
> Thanks,
> Ming
> 
> 

-- 
Damien Le Moal
Western Digital Research