[LSF/MM/BPF TOPIC] Improving Zoned Storage Support

Bart Van Assche <bvanassche@xxxxxxx> · Tue, 16 Jan 2024 10:20:44 -0800

The advantages of zoned storage are well known [1]:
* Higher sequential read and random read performance.
* Lower write amplification.
* Lower tail latency.
* Higher usable capacity because of less overprovisioning.

For many SSDs the L2P (logical to physical translation) table does not
fit entirely in the memory of the storage device. Zoned storage reduces
the size of the L2P table significantly and hence makes it much more
likely that the L2P table fits in the memory of the storage device. If
zoned storage eliminates L2P table paging, random read performance is
improved significantly.

A zoned storage SSD does not have to perform garbage collection. Hence,
write amplification and tail latency are reduced.

Zoned storage gives file systems control over how files are laid out on
the storage device. With zoned storage it is possible to allocate a
contiguous range of storage on the storage medium for a file. This
improves sequential read performance.

Log-structured file systems are a good match for zoned storage. Such
filesystems typically submit large bios to the block layer and have
multiple bios outstanding concurrently. The block layer splits bios if
their size exceeds the max_sectors limit (512 KiB for UFS; 128 KiB for a
popular NVMe controller). This increases the number of concurrently
outstanding bios further.

While the NVMe standard supports two different commands for writing to
zoned storage (Write and Zone Append), the SCSI standard only supports a
single command for writing to zoned storage (WRITE). A write append
emulation for SCSI exists in drivers/scsi/sd_zbc.c.

File system implementers have to decide whether to use Write or Zone
Append. While the Zone Append command tolerates reordering, with this
command the filesystem cannot control the order in which the data is
written on the medium without restricting the queue depth to one.
Additionally, the latency of write operations is lower compared to zone
append operations. From [2], a paper with performance results for one
ZNS SSD model: "we observe that the latency of write operations is lower
than that of append operations, even if the request size is the same".

The mq-deadline I/O scheduler serializes zoned writes even if these got
reordered by the block layer. However, the mq-deadline I/O scheduler,
just like any other single-queue I/O scheduler, is a performance
bottleneck for SSDs that support more than 200 K IOPS. Current NVMe and
UFS 4.0 block devices support more than 200 K IOPS.

Supporting more than 200 K IOPS and giving the filesystem control over
the data layout is only possible by supporting multiple outstanding
writes and by preserving the order of these writes. Hence the proposal
to discuss this topic during the 2024 edition of LSF/MM/BPF summit.
Potential approaches to preserve the order of zoned writes are as follows:
* Track (e.g. in a hash table) for which zones there are pending zoned
  writes and submit all zoned writes per zone to the same hardware
  queue.
* For SCSI, if a SCSI device responds with a unit attention to a zoned
  write, activate the error handler if the block device reports an
  unaligned write error and sort by LBA and resubmit the zoned writes
  from inside the error handler.

In other words, this proposal is about supporting both the Write and
Zone Append commands as first class operations and to let filesystem
implementers decide which command(s) to use.

[1] Stavrinos, Theano, Daniel S. Berger, Ethan Katz-Bassett, and Wyatt
Lloyd. "Don't be a blockhead: zoned namespaces make work on conventional
SSDs obsolete." In Proceedings of the Workshop on Hot Topics in
Operating Systems, pp. 144-151. 2021.

[2] K. Doekemeijer, N. Tehrany, B. Chandrasekaran, M. Bjørling and A.
Trivedi, "Performance Characterization of NVMe Flash Devices with Zoned
Namespaces (ZNS)," 2023 IEEE International Conference on Cluster
Computing (CLUSTER), Santa Fe, NM, USA, 2023, pp. 118-131, doi:
10.1109/CLUSTER52292.2023.00018.
(https://ieeexplore.ieee.org/abstract/document/10319951).