#Motivation: There are currently ZNS drives that are produced and deployed that do not have power_of_2(PO2) zone size. The NVMe spec for ZNS does not specify the PO2 requirement but the linux block layer currently checks for zoned devices to have power_of_2 zone sizes. As a result there are many applications in the kernel such as F2FS, BTRFS and other userspace applications that are designed based on the assumption that zone sizes are PO2. This patchset aims at supporting non-power_of_2 zoned devices without affecting the existing applications by adding an emulation layer for NVMe ZNS devices without regressing the current upstream implementation. #Implementation: A new callback is added to the block device operation fops which is called when a special handling is required by the driver when a non-power_of_2 zoned device is discovered. This patchset adds support only to NVMe ZNS and null block driver to measure performance. The scsi ZAC/ZBC implementation is untouched. Emulation is enabled by doing a static remapping of the zones only in the host and whenever a request is sent to the device via the block layer, a transformation is done to the actual device sector. #Testing: There are two things that need to be tested: no regression on the upstream implementation for PO2 zone sizes and testing the implementation of the emulation itself. To do apple-apples comparison, the following device specs were chosen for testing (both on null_blk and QEMU): PO2 device: zone.size=128M zone.cap=96M NPO2 device: zone.size=96M zone.cap=96M ##Regression: These tests are done on a **PO2 device**. PO2 device used: zone.size=128M zone.cap=96M ###blktests: Blktests were executed with the following config: TEST_DEVS=(/dev/nvme0n2) TIMEOUT=100 RUN_ZONED_TESTS=1 block and zbd tests were performed and no regression were found in the tests. ###Performance: Performance tests were performed on a null blk device. The following fio script was used to measure the performance: fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd --size=23G --io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k --loops=4 No regressions were found with the patches on a **PO2 device** compared to the existing upstream implementation. The following results are an average of 4 runs on AMD Ryzen 5 5600X with 32GB of RAM: Sequential Write: x-----------------x---------------------------------x---------------------------------x | IOdepth | 1 | 4 | x-----------------x---------------------------------x---------------------------------x | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) | x-----------------x---------------------------------x---------------------------------x | Without patches | 155 | 604 | 6.00 | 426 | 1663 | 8.77 | x-----------------x---------------------------------x---------------------------------x | With patches | 157 | 613 | 5.92 | 425 | 1741 | 8.79 | x-----------------x---------------------------------x---------------------------------x x-----------------x---------------------------------x---------------------------------x | IOdepth | 8 | 16 | x-----------------x---------------------------------x---------------------------------x | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) | x-----------------x---------------------------------x---------------------------------x | Without patches | 607 | 2370 | 12.06 | 622 | 2431 | 23.61 | x-----------------x---------------------------------x---------------------------------x | With patches | 621 | 2425 | 11.80 | 633 | 2472 | 23.24 | x-----------------x---------------------------------x---------------------------------x Sequential read: x-----------------x---------------------------------x---------------------------------x | IOdepth | 1 | 4 | x-----------------x---------------------------------x---------------------------------x | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) | x-----------------x---------------------------------x---------------------------------x | Without patches | 165 | 643 | 5.72 | 485 | 1896 | 8.03 | x-----------------x---------------------------------x---------------------------------x | With patches | 167 | 654 | 5.62 | 483 | 1888 | 8.06 | x-----------------x---------------------------------x---------------------------------x x-----------------x---------------------------------x---------------------------------x | IOdepth | 8 | 16 | x-----------------x---------------------------------x---------------------------------x | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) | x-----------------x---------------------------------x---------------------------------x | Without patches | 696 | 2718 | 11.29 | 692 | 2701 | 22.92 | x-----------------x---------------------------------x---------------------------------x | With patches | 696 | 2718 | 11.29 | 730 | 2835 | 21.70 | x-----------------x---------------------------------x---------------------------------x Random read: x-----------------x---------------------------------x---------------------------------x | IOdepth | 1 | 4 | x-----------------x---------------------------------x---------------------------------x | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) | x-----------------x---------------------------------x---------------------------------x | Without patches | 159 | 623 | 5.86 | 451 | 1760 | 8.58 | x-----------------x---------------------------------x---------------------------------x | With patches | 163 | 635 | 5.75 | 462 | 1806 | 8.36 | x-----------------x---------------------------------x---------------------------------x x-----------------x---------------------------------x---------------------------------x | IOdepth | 8 | 16 | x-----------------x---------------------------------x---------------------------------x | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) | x-----------------x---------------------------------x---------------------------------x | Without patches | 544 | 2124 | 14.44 | 553 | 2162 | 28.64 | x-----------------x---------------------------------x---------------------------------x | With patches | 554 | 2165 | 14.15 | 556 | 2171 | 28.52 | x-----------------x---------------------------------x---------------------------------x ##Emulated device NPO2 device: zone.size=96M zone.cap=96M ###blktests: Blktests were executed with the following config: TEST_DEVS=(/dev/nvme0n2) TIMEOUT=100 RUN_ZONED_TESTS=1 block and zbd tests were performed and they are passing. ###Performance: Performance tests were performed on a null blk device. The following fio script was used to measure the performance: fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd --size=23G --io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k --loops=4 On an average, the NPO2 devices had a performance degradation of less than 1% compared to the PO2 devices. The following results are an average of 4 runs on AMD Ryzen 5 5600X with 32GB of RAM: Write: x-----------------x---------------------------------x---------------------------------x | IOdepth | 1 | 4 | x-----------------x---------------------------------x---------------------------------x | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) | x-----------------x---------------------------------x---------------------------------x | With patches | 155 | 606 | 5.99 | 424 | 1655 | 8.83 | x-----------------x---------------------------------x---------------------------------x x-----------------x---------------------------------x---------------------------------x | IOdepth | 8 | 16 | x-----------------x---------------------------------x---------------------------------x | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) | x-----------------x---------------------------------x---------------------------------x | With patches | 609 | 2378 | 12.04 | 620 | 2421 | 23.75 | x-----------------x---------------------------------x---------------------------------x SEQREAD: x-----------------x---------------------------------x---------------------------------x | IOdepth | 1 | 4 | x-----------------x---------------------------------x---------------------------------x | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) | x-----------------x---------------------------------x---------------------------------x | With patches | 160 | 623 | 5.91 | 481 | 1878 | 8.11 | x-----------------x---------------------------------x---------------------------------x x-----------------x---------------------------------x---------------------------------x | IOdepth | 8 | 16 | x-----------------x---------------------------------x---------------------------------x | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) | x-----------------x---------------------------------x---------------------------------x | With patches | 696 | 2720 | 11.28 | 722 | 2819 | 21.96 | x-----------------x---------------------------------x---------------------------------x RANDREAD: x-----------------x---------------------------------x---------------------------------x | IOdepth | 1 | 4 | x-----------------x---------------------------------x---------------------------------x | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) | x-----------------x---------------------------------x---------------------------------x | With patches | 155 | 607 | 6.03 | 465 | 1817 | 8.31 | x-----------------x---------------------------------x---------------------------------x x-----------------x---------------------------------x---------------------------------x | IOdepth | 8 | 16 | x-----------------x---------------------------------x---------------------------------x | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) | x-----------------x---------------------------------x---------------------------------x | With patches | 552 | 2158 | 14.21 | 561 | 2190 | 28.27 | x-----------------x---------------------------------x---------------------------------x #TODO: - The current implementation only works for the NVMe pci transport to limit the scope and impact. Support for NVMe target will follow soon. Pankaj Raghav (6): nvme: zns: Allow ZNS drives that have non-power_of_2 zone size block: Add npo2_zone_setup callback to block device fops block: add a bool member to request_queue for power_of_2 emulation nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices null_blk: forward the sector value from null_handle_memory_backend null_blk: Add support for power_of_2 emulation to the null blk device block/blk-zoned.c | 3 + drivers/block/null_blk/main.c | 18 +-- drivers/block/null_blk/null_blk.h | 12 ++ drivers/block/null_blk/zoned.c | 203 ++++++++++++++++++++++++++---- drivers/nvme/host/core.c | 28 +++-- drivers/nvme/host/nvme.h | 100 ++++++++++++++- drivers/nvme/host/pci.c | 4 + drivers/nvme/host/zns.c | 86 +++++++++++-- include/linux/blk-mq.h | 2 + include/linux/blkdev.h | 25 ++++ 10 files changed, 428 insertions(+), 53 deletions(-) -- 2.25.1