Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support

Damien Le Moal <dlemoal@xxxxxxxxxx> · Wed, 17 Jan 2024 08:34:43 +0900

On 1/17/24 03:20, Bart Van Assche wrote:
> The advantages of zoned storage are well known [1]:
> * Higher sequential read and random read performance.
> * Lower write amplification.
> * Lower tail latency.
> * Higher usable capacity because of less overprovisioning.
> 
> For many SSDs the L2P (logical to physical translation) table does not
> fit entirely in the memory of the storage device. Zoned storage reduces
> the size of the L2P table significantly and hence makes it much more
> likely that the L2P table fits in the memory of the storage device. If
> zoned storage eliminates L2P table paging, random read performance is
> improved significantly.
> 
> A zoned storage SSD does not have to perform garbage collection. Hence,
> write amplification and tail latency are reduced.
> 
> Zoned storage gives file systems control over how files are laid out on
> the storage device. With zoned storage it is possible to allocate a
> contiguous range of storage on the storage medium for a file. This
> improves sequential read performance.
> 
> Log-structured file systems are a good match for zoned storage. Such
> filesystems typically submit large bios to the block layer and have
> multiple bios outstanding concurrently. The block layer splits bios if
> their size exceeds the max_sectors limit (512 KiB for UFS; 128 KiB for a
> popular NVMe controller). This increases the number of concurrently
> outstanding bios further.
> 
> While the NVMe standard supports two different commands for writing to
> zoned storage (Write and Zone Append), the SCSI standard only supports a
> single command for writing to zoned storage (WRITE). A write append
> emulation for SCSI exists in drivers/scsi/sd_zbc.c.
> 
> File system implementers have to decide whether to use Write or Zone
> Append. While the Zone Append command tolerates reordering, with this
> command the filesystem cannot control the order in which the data is
> written on the medium without restricting the queue depth to one.
> Additionally, the latency of write operations is lower compared to zone
> append operations. From [2], a paper with performance results for one
> ZNS SSD model: "we observe that the latency of write operations is lower
> than that of append operations, even if the request size is the same".

What is the queue depth for this claim ?

> The mq-deadline I/O scheduler serializes zoned writes even if these got
> reordered by the block layer. However, the mq-deadline I/O scheduler,
> just like any other single-queue I/O scheduler, is a performance
> bottleneck for SSDs that support more than 200 K IOPS. Current NVMe and
> UFS 4.0 block devices support more than 200 K IOPS.

FYI, I am about to post 20-something patches that completely remove zone write
locking and replace it with "zone write plugging". That is done above the IO
scheduler and also provides zone append emulation for drives that ask for it.

With this change:
 - Zone append emulation is moved to the block layer, as a generic
implementation. sd and dm zone append emulation code is removed.
 - Any scheduler can be used, including "none". mq-deadline zone block device
special support is removed.
 - Overall, a lot less code (the series removes more code than it adds).
 - Reordering problems such as due to IO priority is resolved as well.

This will need a lot of testing, which we are working on. But your help with
testing on UFS devices will be appreciated as well.

> 
> Supporting more than 200 K IOPS and giving the filesystem control over
> the data layout is only possible by supporting multiple outstanding
> writes and by preserving the order of these writes. Hence the proposal
> to discuss this topic during the 2024 edition of LSF/MM/BPF summit.
> Potential approaches to preserve the order of zoned writes are as follows:
> * Track (e.g. in a hash table) for which zones there are pending zoned
>    writes and submit all zoned writes per zone to the same hardware
>    queue.
> * For SCSI, if a SCSI device responds with a unit attention to a zoned
>    write, activate the error handler if the block device reports an
>    unaligned write error and sort by LBA and resubmit the zoned writes
>    from inside the error handler.
> 
> In other words, this proposal is about supporting both the Write and
> Zone Append commands as first class operations and to let filesystem
> implementers decide which command(s) to use.
> 
> [1] Stavrinos, Theano, Daniel S. Berger, Ethan Katz-Bassett, and Wyatt
> Lloyd. "Don't be a blockhead: zoned namespaces make work on conventional
> SSDs obsolete." In Proceedings of the Workshop on Hot Topics in
> Operating Systems, pp. 144-151. 2021.
> 
> [2] K. Doekemeijer, N. Tehrany, B. Chandrasekaran, M. Bjørling and A.
> Trivedi, "Performance Characterization of NVMe Flash Devices with Zoned
> Namespaces (ZNS)," 2023 IEEE International Conference on Cluster
> Computing (CLUSTER), Santa Fe, NM, USA, 2023, pp. 118-131, doi:
> 10.1109/CLUSTER52292.2023.00018.
> (https://ieeexplore.ieee.org/abstract/document/10319951).
> 

-- 
Damien Le Moal
Western Digital Research