On 2020/07/17 19:02, Ming Lei wrote: > On Fri, Jul 17, 2020 at 09:19:50AM +0000, Damien Le Moal wrote: >> On 2020/07/17 18:12, Ming Lei wrote: >>> On Fri, Jul 17, 2020 at 08:22:45AM +0000, Damien Le Moal wrote: >>>> On 2020/07/17 16:50, Ming Lei wrote: >>>>> On Fri, Jul 17, 2020 at 02:45:25AM +0000, Damien Le Moal wrote: >>>>>> On 2020/07/16 23:35, Christoph Hellwig wrote: >>>>>>> On Thu, Jul 16, 2020 at 07:09:33PM +0900, Johannes Thumshirn wrote: >>>>>>>> Max append sectors needs to be aligned to physical block size, otherwise >>>>>>>> we can end up in a situation where it's off by 1-3 sectors which would >>>>>>>> cause short writes with asynchronous zone append submissions from an FS. >>>>>>> >>>>>>> Huh? The physical block size is purely a hint. >>>>>> >>>>>> For ZBC/ZAC SMR drives, all writes must be aligned to the physical sector size. >>>>> >>>>> Then the physical block size should be same with logical block size. >>>>> The real workable limit for io request is aligned with logical block size. >>>> >>>> Yes, I know. This T10/T13 design is not the brightest thing they did... on 512e >>>> SMR drives, addressing is LBA=512B unit, but all writes in sequential zones must >>>> be 4K aligned (8 LBAs). >>> >>> Then the issue isn't related with zone append command only. Just wondering how this >>> special write block size alignment is enhanced in sequential zones. So far, write >>> from FS or raw block size is only logical block size aligned. >> >> This is not enforced in sd/sd_zbc.c. If the user issues a non 4K aligned >> request, it will get back an "unaligned write" error from the drive. zonefs and >> dm-zoned define a 4K block size to avoid that. For applications doing raw block >> device accesses, they have to issue properly aligned writes. > > OK, then I guess either: > > 1) the same write alignment issue for zone append can be handled in > same way with normal write on seq zone Yes, absolutely. I think this is fine. The point was that we at least should be exposing a meaningful maximum command size for that, so that the user does not have to align that maximum too. > > OR > > 2) add one new limit for write on seq zone, such as: zone_write_block_size > > Then the two cases can be dealt with in same way, and physical block > size is usually a hint as Christoph mentioned, looks a bit weird to use > it in this way, or at least the story should be documented. Yeah, but zone_write_block_size would end up always being equal to the physical block size for SMR. For ZNS and nullblk, logical block size == physical block size, always, so it would not be useful. I do not think such change is necessary. > >> >>> >>>> >>>>> >>>>>> However, sd/sd_zbc does not change max_hw_sectors_kb to ensure alignment to 4K >>>>>> on 512e disks. There is also nullblk which uses the default max_hw_sectors_kb to >>>>>> 255 x 512B sectors, which is not 4K aligned if the nullb device is created with >>>>>> 4K block size. >>>>> >>>>> Actually the real limit is from max_sectors_kb which is <= max_hw_sectors_kb, and >>>>> both should be aligned with logical block size, IMO. >>>> >>>> Yes, agreed, but for nullblk device created with block size = 4K it is not. So >>> >>> That is because the default magic number of BLK_SAFE_MAX_SECTORS. >>> >>>> one driver to patch for sure. However, I though having some forced alignment in >>>> blk_queue_max_hw_sectors() for limit->max_hw_sectors and limit->max_sectors >>>> would avoid tripping on weird values for weird drives... >>> >>> Maybe we can update it once the logical block size is available, such >>> as: >>> >>> diff --git a/block/blk-settings.c b/block/blk-settings.c >>> index 9a2c23cd9700..f9cbaadaa267 100644 >>> --- a/block/blk-settings.c >>> +++ b/block/blk-settings.c >>> @@ -311,6 +311,14 @@ void blk_queue_max_segment_size(struct request_queue *q, unsigned int max_size) >>> } >>> EXPORT_SYMBOL(blk_queue_max_segment_size); >>> >>> +static unsigned blk_queue_round_sectors(struct request_queue *q, >>> + unsigned sectors) >>> +{ >>> + u64 bytes = sectors << 9; >>> + >>> + return (unsigned)round_down(bytes, queue_logical_block_size(q)); >>> +} >>> + >>> /** >>> * blk_queue_logical_block_size - set logical block size for the queue >>> * @q: the request queue for the device >>> @@ -330,6 +338,9 @@ void blk_queue_logical_block_size(struct request_queue *q, unsigned int size) >>> >>> if (q->limits.io_min < q->limits.physical_block_size) >>> q->limits.io_min = q->limits.physical_block_size; >>> + >>> + q->limits.max_sectors = blk_queue_round_sectors(q, >>> + q->limits.max_sectors) >>> } >>> EXPORT_SYMBOL(blk_queue_logical_block_size); >> >> Yes, something like this was what I had in mind so that 4Kn drives get a >> sensible value aligned to the 4K LBA, always. However, with the above, there is >> no guarantee that max_sectors is already set when the logical block size is set. >> I am not sure about the reverse either. So the rounding may need to be in both >> blk_queue_logical_block_size() and blk_queue_max_hw_sectors(). > > OK, that looks better. OK. Johannes, care to resend your patch ? > > Thanks, > Ming > > -- Damien Le Moal Western Digital Research