On 2020/04/20 9:49, Bart Van Assche wrote: > On 4/19/20 5:30 PM, Damien Le Moal wrote: >> On 2020/04/19 1:46, Bart Van Assche wrote: >>> On 2020-04-17 05:15, Johannes Thumshirn wrote: >>>> From: Keith Busch <kbusch@xxxxxxxxxx> >>>> +/* >>>> + * Check write append to a zoned block device. >>>> + */ >>>> +static inline blk_status_t blk_check_zone_append(struct request_queue *q, >>>> + struct bio *bio) >>>> +{ >>>> + sector_t pos = bio->bi_iter.bi_sector; >>>> + int nr_sectors = bio_sectors(bio); >>>> + >>>> + /* Only applicable to zoned block devices */ >>>> + if (!blk_queue_is_zoned(q)) >>>> + return BLK_STS_NOTSUPP; >>>> + >>>> + /* The bio sector must point to the start of a sequential zone */ >>>> + if (pos & (blk_queue_zone_sectors(q) - 1) || >>>> + !blk_queue_zone_is_seq(q, pos)) >>>> + return BLK_STS_IOERR; >>>> + >>>> + /* >>>> + * Not allowed to cross zone boundaries. Otherwise, the BIO will be >>>> + * split and could result in non-contiguous sectors being written in >>>> + * different zones. >>>> + */ >>>> + if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors)) >>>> + return BLK_STS_IOERR; >>> >>> Can the above statement be simplified into the following? >>> >>> if (nr_sectors > q->limits.chunk_sectors) >>> return BLK_STS_IOERR; >> >> That would be equivalent only if the zone is empty. If the zone is not empty, we >> need to check that the zone append request does not cross over to the next zone, >> which would result in the BIO being split by the block layer. > > At the start of blk_check_zone_append() function there is a check that > 'pos' is aligned with a zone boundary. How can 'pos' at the same time > represent a zone boundary and the exact offset at which the write will > happen? I do not understand this. pos indicates the zone to be written, not the actual position the data will be written to. The actual position will be determined by the drive when it receives the zone append command. That position will be, of course, the zone write pointer which the drive manages. The actual position the data was written to is returned by the drive and passed along in the IO stack using bio->bi_iter.bi_sector. Example: A thread sends a 4KB zone append writes to a zone starting at sector Z with a write pointer value of W. On submission, bio->bi_iter.bi_sector must be equal to Z. On completion, bio->bi_iter.bi_sector will be W. If another zone append is sent, again bio->bi_iter.bi_sector must be equal to Z and on completion, bio->bi_iter.bi_sector will be equal to W+8 (since 4KB have been written with the previous zone append). Zone append allows an application to send writes to any zone without knowing the zone write pointer. The only think the user needs to know is if the zone has enough free space or not. For a file system, this changes the current classic model of (a) allocate blocks, write allocated blocks, commit metadata, to something like (b) pick a target zone with free space, write blocks, mark used blocks as allocated, commit metadata. > >>>> + /* Make sure the BIO is small enough and will not get split */ >>>> + if (nr_sectors > q->limits.max_zone_append_sectors) >>>> + return BLK_STS_IOERR; >>> >>> Do we really need a new request queue limit parameter? In which cases >>> will max_zone_append_sectors differ from the zone size? >> >> Yes it can differ from the zone size. On real hardware, max_zone_append_sectors >> will most of the time be equal to max_hw_sectors_kb. But it could be smaller >> than that too. For the host, since a zone append is a write, it is also subject >> to the max_sector_kb limit and cannot exceed that value. > > Would it be possible to use min(max_hw_sectors, zone_size) instead of > introducing a new user-visible parameter? For scsi emulation, that is what is being used to set max_zone_append_sectors. But that value may not be appropriate for a ZNS NVMe device as that device may have different constraints for regular writes and zone append writes. > > Thanks, > > Bart. > -- Damien Le Moal Western Digital Research