Re: [PATCH v7 04/11] block: Introduce REQ_OP_ZONE_APPEND

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Mon, 20 Apr 2020 01:08:36 +0000

On 2020/04/20 9:49, Bart Van Assche wrote:
> On 4/19/20 5:30 PM, Damien Le Moal wrote:
>> On 2020/04/19 1:46, Bart Van Assche wrote:
>>> On 2020-04-17 05:15, Johannes Thumshirn wrote:
>>>> From: Keith Busch <kbusch@xxxxxxxxxx>
>>>> +/*
>>>> + * Check write append to a zoned block device.
>>>> + */
>>>> +static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>>>> +						 struct bio *bio)
>>>> +{
>>>> +	sector_t pos = bio->bi_iter.bi_sector;
>>>> +	int nr_sectors = bio_sectors(bio);
>>>> +
>>>> +	/* Only applicable to zoned block devices */
>>>> +	if (!blk_queue_is_zoned(q))
>>>> +		return BLK_STS_NOTSUPP;
>>>> +
>>>> +	/* The bio sector must point to the start of a sequential zone */
>>>> +	if (pos & (blk_queue_zone_sectors(q) - 1) ||
>>>> +	    !blk_queue_zone_is_seq(q, pos))
>>>> +		return BLK_STS_IOERR;
>>>> +
>>>> +	/*
>>>> +	 * Not allowed to cross zone boundaries. Otherwise, the BIO will be
>>>> +	 * split and could result in non-contiguous sectors being written in
>>>> +	 * different zones.
>>>> +	 */
>>>> +	if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors))
>>>> +		return BLK_STS_IOERR;
>>>
>>> Can the above statement be simplified into the following?
>>>
>>> 	if (nr_sectors > q->limits.chunk_sectors)
>>> 		return BLK_STS_IOERR;
>>
>> That would be equivalent only if the zone is empty. If the zone is not empty, we
>> need to check that the zone append request does not cross over to the next zone,
>> which would result in the BIO being split by the block layer.
> 
> At the start of blk_check_zone_append() function there is a check that 
> 'pos' is aligned with a zone boundary. How can 'pos' at the same time 
> represent a zone boundary and the exact offset at which the write will 
> happen? I do not understand this.

pos indicates the zone to be written, not the actual position the data will be
written to. The actual position will be determined by the drive when it receives
the zone append command. That position will be, of course, the zone write
pointer which the drive manages. The actual position the data was written to is
returned by the drive and passed along in the IO stack using bio->bi_iter.bi_sector.

Example: A thread sends a 4KB zone append writes to a zone starting at sector Z
with a write pointer value of W. On submission, bio->bi_iter.bi_sector must be
equal to Z. On completion, bio->bi_iter.bi_sector will be W. If another zone
append is sent, again bio->bi_iter.bi_sector must be equal to Z and on
completion, bio->bi_iter.bi_sector will be equal to W+8 (since 4KB have been
written with the previous zone append).

Zone append allows an  application to send writes to any zone without knowing
the zone write pointer. The only think the user needs to know is if the zone has
enough free space or not. For a file system, this changes the current classic
model of (a) allocate blocks, write allocated blocks, commit metadata, to
something like (b) pick a target zone with free space, write blocks, mark used
blocks as allocated, commit metadata.

> 
>>>> +	/* Make sure the BIO is small enough and will not get split */
>>>> +	if (nr_sectors > q->limits.max_zone_append_sectors)
>>>> +		return BLK_STS_IOERR;
>>>
>>> Do we really need a new request queue limit parameter? In which cases
>>> will max_zone_append_sectors differ from the zone size?
>>
>> Yes it can differ from the zone size. On real hardware, max_zone_append_sectors
>> will most of the time be equal to max_hw_sectors_kb. But it could be smaller
>> than that too. For the host, since a zone append is a write, it is also subject
>> to the max_sector_kb limit and cannot exceed that value.
> 
> Would it be possible to use min(max_hw_sectors, zone_size) instead of 
> introducing a new user-visible parameter?

For scsi emulation, that is what is being used to set max_zone_append_sectors.
But that value may not be appropriate for a ZNS NVMe device as that device may
have different constraints for regular writes and zone append writes.

> 
> Thanks,
> 
> Bart.
> 

-- 
Damien Le Moal
Western Digital Research