The advantages of zoned storage are well known [1]: * Higher sequential read and random read performance. * Lower write amplification. * Lower tail latency. * Higher usable capacity because of less overprovisioning. For many SSDs the L2P (logical to physical translation) table does not fit entirely in the memory of the storage device. Zoned storage reduces the size of the L2P table significantly and hence makes it much more likely that the L2P table fits in the memory of the storage device. If zoned storage eliminates L2P table paging, random read performance is improved significantly. A zoned storage SSD does not have to perform garbage collection. Hence, write amplification and tail latency are reduced. Zoned storage gives file systems control over how files are laid out on the storage device. With zoned storage it is possible to allocate a contiguous range of storage on the storage medium for a file. This improves sequential read performance. Log-structured file systems are a good match for zoned storage. Such filesystems typically submit large bios to the block layer and have multiple bios outstanding concurrently. The block layer splits bios if their size exceeds the max_sectors limit (512 KiB for UFS; 128 KiB for a popular NVMe controller). This increases the number of concurrently outstanding bios further. While the NVMe standard supports two different commands for writing to zoned storage (Write and Zone Append), the SCSI standard only supports a single command for writing to zoned storage (WRITE). A write append emulation for SCSI exists in drivers/scsi/sd_zbc.c. File system implementers have to decide whether to use Write or Zone Append. While the Zone Append command tolerates reordering, with this command the filesystem cannot control the order in which the data is written on the medium without restricting the queue depth to one. Additionally, the latency of write operations is lower compared to zone append operations. From [2], a paper with performance results for one ZNS SSD model: "we observe that the latency of write operations is lower than that of append operations, even if the request size is the same". The mq-deadline I/O scheduler serializes zoned writes even if these got reordered by the block layer. However, the mq-deadline I/O scheduler, just like any other single-queue I/O scheduler, is a performance bottleneck for SSDs that support more than 200 K IOPS. Current NVMe and UFS 4.0 block devices support more than 200 K IOPS. Supporting more than 200 K IOPS and giving the filesystem control over the data layout is only possible by supporting multiple outstanding writes and by preserving the order of these writes. Hence the proposal to discuss this topic during the 2024 edition of LSF/MM/BPF summit. Potential approaches to preserve the order of zoned writes are as follows: * Track (e.g. in a hash table) for which zones there are pending zoned writes and submit all zoned writes per zone to the same hardware queue. * For SCSI, if a SCSI device responds with a unit attention to a zoned write, activate the error handler if the block device reports an unaligned write error and sort by LBA and resubmit the zoned writes from inside the error handler. In other words, this proposal is about supporting both the Write and Zone Append commands as first class operations and to let filesystem implementers decide which command(s) to use. [1] Stavrinos, Theano, Daniel S. Berger, Ethan Katz-Bassett, and Wyatt Lloyd. "Don't be a blockhead: zoned namespaces make work on conventional SSDs obsolete." In Proceedings of the Workshop on Hot Topics in Operating Systems, pp. 144-151. 2021. [2] K. Doekemeijer, N. Tehrany, B. Chandrasekaran, M. Bjørling and A. Trivedi, "Performance Characterization of NVMe Flash Devices with Zoned Namespaces (ZNS)," 2023 IEEE International Conference on Cluster Computing (CLUSTER), Santa Fe, NM, USA, 2023, pp. 118-131, doi: 10.1109/CLUSTER52292.2023.00018. (https://ieeexplore.ieee.org/abstract/document/10319951).