Re: [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices

Jens Axboe <axboe@xxxxxxxxx> · Tue, 12 May 2020 20:37:57 -0600

On 5/12/20 2:55 AM, Johannes Thumshirn wrote:
> The upcoming NVMe ZNS Specification will define a new type of write
> command for zoned block devices, zone append.
> 
> When when writing to a zoned block device using zone append, the start
> sector of the write is pointing at the start LBA of the zone to write to.
> Upon completion the block device will respond with the position the data
> has been placed in the zone. This from a high level perspective can be
> seen like a file system's block allocator, where the user writes to a
> file and the file-system takes care of the data placement on the device.
> 
> In order to fully exploit the new zone append command in file-systems and
> other interfaces above the block layer, we choose to emulate zone append
> in SCSI and null_blk. This way we can have a single write path for both
> file-systems and other interfaces above the block-layer, like io_uring on
> zoned block devices, without having to care too much about the underlying
> characteristics of the device itself.
> 
> The emulation works by providing a cache of each zone's write pointer, so
> zone append issued to the disk can be translated to a write with a
> starting LBA of the write pointer. This LBA is used as input zone number
> for the write pointer lookup in the zone write pointer offset cache and
> the cached offset is then added to the LBA to get the actual position to
> write the data. In SCSI we then turn the REQ_OP_ZONE_APPEND request into a
> WRITE(16) command. Upon successful completion of the WRITE(16), the cache
> will be updated to the new write pointer location and the written sector
> will be noted in the request. On error the cache entry will be marked as
> invalid and on the next write an update of the write pointer will be
> scheduled, before issuing the actual write.
> 
> In order to reduce memory consumption, the only cached item is the offset
> of the write pointer from the start of the zone, everything else can be
> calculated. On an example drive with 52156 zones, the additional memory
> consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
> pages. The performance impact is neglectable for a spinning drive.
> 
> For null_blk the emulation is way simpler, as null_blk's zoned block
> device emulation support already caches the write pointer position, so we
> only need to report the position back to the upper layers. Additional
> caching is not needed here.
> 
> Furthermore we have converted zonefs to run use ZONE_APPEND for synchronous
> direct I/Os. Asynchronous I/O still uses the normal path via iomap.
> 
> Performance testing with zonefs sync writes on a 14 TB SMR drive and nullblk
> shows good results. On the SMR drive we're not regressing (the performance
> improvement is within noise), on nullblk we could drastically improve specific
> workloads:
> 
> * nullblk:
> 
> Single Thread Multiple Zones
> 				kIOPS	MiB/s	MB/s	% delta
> mq-deadline REQ_OP_WRITE	10.1	631	662
> mq-deadline REQ_OP_ZONE_APPEND	13.2	828	868	+31.12
> none REQ_OP_ZONE_APPEND		15.6	978	1026	+54.98
> 
> 
> Multiple Threads Multiple Zones
> 				kIOPS	MiB/s	MB/s	% delta
> mq-deadline REQ_OP_WRITE	10.2	640	671
> mq-deadline REQ_OP_ZONE_APPEND	10.4	650	681	+1.49
> none REQ_OP_ZONE_APPEND		16.9	1058	1109	+65.28
> 
> * 14 TB SMR drive
> 
> Single Thread Multiple Zones
> 				IOPS	MiB/s	MB/s	% delta
> mq-deadline REQ_OP_WRITE	797	49.9	52.3
> mq-deadline REQ_OP_ZONE_APPEND	806	50.4	52.9	+1.15
> 
> Multiple Threads Multiple Zones
> 				kIOPS	MiB/s	MB/s	% delta
> mq-deadline REQ_OP_WRITE	745	46.6	48.9
> mq-deadline REQ_OP_ZONE_APPEND	768	48	50.3	+2.86
> 
> The %-delta is against the baseline of REQ_OP_WRITE using mq-deadline as I/O
> scheduler.
> 
> The series is based on Jens' for-5.8/block branch with HEAD:
> ae979182ebb3 ("bdi: fix up for "remove the name field in struct backing_dev_info"")

Applied for 5.8, thanks.

-- 
Jens Axboe