On Mon, Feb 26, 2024 at 07:25:23AM -0800, Luis Chamberlain wrote: > The only thing left worth discussing I think is if we want to let users to > opt-in to 4k sector size on a drive which allows 16k atomics and > prefers 16k for instance... Thinking about this again, I get a sense things are OK as-is but let's review. I'd also like to clarify that these drives keep a 4k LBA format. The only thing that changes is an increased is the the IU and large atomic. The nvme driver today sets the physical block size to the min of both. It is similar to a drive today with a logical block size of 512 but a physical block size of 4k. That allows you to specify a larger sector size to create the filesystem. After reviewing again our current language for syfs parameters in Documentation/ABI/stable/sysfs-block for logical and physical block size and how nvme_update_disk_info() deals with NPWG (the IU used), NAWUPF (namespace atomic) and NOWS (optimal write size), we seem to be using it aready appropriately. However, I think the language used in the original commit c72758f33784e ("block: Export I/O topology for block devices and partitions") on May 22, 2009, which clearly outlined the implications of a read-modify-write makes it even clearer. A later commit 7e5f5fb09e6fc ("block: Update topology documentation") 3 months later updates the documentation to remove the read-modify-write language in favor of "atomic". Even though we'd like to believe that userspace is doing the right thing today, I think it would be good to review this today and ensure we're doing the right thing to make things transparent. We have two types large capacity NVMe drives to consider. As far as I can tell all drives will always supporting 512 or 4k LBA format which controls the logical block size, so they always remain backward compatible. It would be up to users to format the drives to 512 or 4k. One type of drive is without the large NAWUPF (atomic), and another with it. Both will have a large NPWG (the IU). The NPWG is used to set minimum_io_size, and so, using the original commit's langauge for minimum_io_size, direct IO would benefit best to rely on that as a minimum. At least Red Hat's documentation [0] about this suggests that minimum_io_size will be read by userspace but at least for direct IO it suggests that direct IO must be aligned to *mutiples of the logical block size*. That does not clarify to me however if the minimum IO used in userspace today for direct IO will rely on minimum_io_size. If it is, then things will work optimally for these drives already. When a large atomic is supported (NAWUPF) the physical block size will be lifted, and users can use that to create a filesystem with a larger sector size than 4k. That certainly would help ensure at least the filesystem aligns all metadata and data to the large IU. After Jan Kara's patches which prevent writes to the block device once a filesystem is mounted, userspace would not be allowed to be mucking around with the block device, so userspace IO using the raw block device with, say a smaller logical sector size would not be allowed. Since, in these cases the sector size is set to a larger value for the filesystem, direct IO on the filesystem should respect that preferred larger sector size. If userspace is indeed already relying on minimum_io_size correctly I am not sure if we need to do any change. Drives with a large NPWG would get minimum_io_size set for them already. And a large atomic would just lift the physical block size. So I don't think we need to force the logical block size to be 16k of both NPWG and NAWUPF are 16k. *Iff* however, we feel we may want to help userspace further, I wonder if having the option to lift the logical block size to the NPWG is desriable. I did some testing with fio against a 4k physical virtio drive with a 512 byte logical block size and creating a 4k block size XFS filesystem with a 4k sector size. fio seems to chug along happy if you issue writes with -bs=512 and even -blockalign=512. Using Daniel Gomez's ./tools/blkalgn.py tool I still see 512 IO commands issued, and I don't think they failed. But this was against a virtio drive for convenience. QEMU in NVMe today doesn't let you have a different logical block size than physical, so you'd need to do some odd hacks to test something similar to emulate a large atomic. root@frag ~/bcc (git::blkalgn)# cat /sys/block/vdh/queue/physical_block_size 4096 root@frag ~/bcc (git::blkalgn)# cat /sys/block/vdh/queue/logical_block_size 512 mkfs.xfs -f -b size=4k -s size=4k /dev/vdh fio -name=ten-1g-per-thread --nrfiles=10 -bs=512 -ioengine=io_uring \ -direct=1 \ -blockalign=512 \ --group_reporting=1 --alloc-size=1048576 --filesize=8KiB \ --readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1 \ --directory=/mnt root@frag ~/bcc (git::blkalgn)# ./tools/blkalgn.py -d vdh Block size : count distribution 0 -> 1 : 4 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 79 |**** | 1024 -> 2047 : 320 |******************** | 2048 -> 4095 : 638 |****************************************| 4096 -> 8191 : 161 |********** | 8192 -> 16383 : 0 | | 16384 -> 32767 : 1 | | Algn size : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 1196 |****************************************| Userspace can still be do silly things, but I expected in the above that 512 IOs would not be issued. [0] https://access.redhat.com/articles/3911611 Luis