Christoph added CONFIG_BUFFER_HEAD on v6.1 enabling a world where we can live without buffer-heads. When we opt into that world we end up also using the address space operations of the block device cache using iomap. Since iomap supports higher order folios it means then that block devices which do use the iomap aops can end up having a logical block size or physical block size greater than PAGE_SIZE. We refer to these as LBS devices. This in turn allows filesystems which support bs > 4k to be enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS device then to take advantage of the recenlty posted work today to enable LBS support for filesystems [0]. However, an issue is that disabling CONFIG_BUFFER_HEAD in practice is not viable for many Linux distributions since it also means disabling support for most filesystems other than btrfs and XFS. So we either support larger order folios on buffer-heads, or we draw up a solution to enable co-existence. Since at LSFMM 2023 it was decided we would not support larger order folios on buffer-heads, this takes the approach to help support co-existence between using buffer-head based filesystems and using the iomap aops dynamically when needed on block devices. Folks who want to enhance buffer-heads to support larger order folios can do so in parallel to this effort. This just allows iomap based filesystems to support LBS block devices today while also allowing buffer-head based filesystems to be used on them as well the LBS device also supports a logical block of PAGE_SIZE. LBS devices can come in a few flavors. Adopting larger LBA format enables a logical block sizes to be > 4k. Different block device mechanisms which can also increase a block device's physical block size to be larger than 4k while still retaining backward compatibility with a logical block size of 4k. An example is NVMe drives with an LBA format of 4k and an npwg and awupf | nawupf of 16k. With this you a logical block size of 4k and a physical block size of 16k. Drives which do this may be a reality as larger Indirection Units (IUs) on QLC adopt IUs larger than 4k so to better support larger capacity. At least OCP 2.0 does require you to expose the IU through npwg. Whether or not a device supports the same value for awupf | nawupf (large atomic) is device specific. Devices which *do* sport an LBA format of 4k and npwg & awupf | nawupf >= 16k essentially can benefit from LBS support such as with XFS [1]. While LBS devices come to market folks can also experiment with NVMe today what some of this might be like by ignoring power failure and faking the larger atomic up to the NVMe awun. When the NVMe awun is 0xffff it means awun is MDTS, so for those drives, in practice you can experiment today with LBS up to MDTS with real drives. This is documented on the kdevops documentation for LBS by using something like nvme_core.debug_large_atomics=16384 [2]. To experiment with larger LBA formtas you can also use kdevops and enable CONFIG_QEMU_ENABLE_EXTRA_DRIVE_LARGEIO. That enables a ton of drives with logical and physical block sizes >= 4k up to a desriable max target for experimentation. Since filesystems today only support up to 32k sector sizes, in practice you may only want to experiment up to 32k physical / logical. Support for 64k sector sizes requires an XFS format change, which is something Daniel Gomez has experimental patches for, in case folks are interested in messing with. Patch 6 could probably be squashed with patch 5, but I wanted to be explicit about this, as this should be decided with the community. There might be a better way to do this than do deal with the switching of the aops dynamically, ideas welcomed! The NVMe debug patches are posted for pure experimentation purposes, but the AWUN patch might be welcomed upsteram. Unless folks really want it, the goal here is *not* to upstream the support for the debug module parameter nvme_core.debug_large_atomics=16384. On a v3 RFC I can drop those patches and just aim to simplify the logic to support LBA formats > ps. Changes since v1 RFC: o Modified the approach to accept we can have different aops per super block on the block device cache as suggested by Christoph o Try to allow dynamically switching between iomap aops and buffer-head aops o Use the iomap aops if the target block size order or min order > ps o Loosten restrictions on set_blocksize() so to just check for the iomap aops for now, making it easy to allow buffer-heads to later add support for bs > ps as well o Use buffer-head aops first always unless the min order is > ps so to always support buffer-head based filesystems o Adopt a flag for buffer-heads so to check for support, buffer-heads based filesystems cannot be used while an LBS filesystem is being used, for instance o Allow iomap aops to be used when the physical block size is > ps as well o NVMe changes to experiment with LBS devices without power failure o NVMe changes to support LBS devices [0] https://lkml.kernel.org/r/20230608032404.1887046-1-mcgrof@xxxxxxxxxx [1] https://lkml.kernel.org/r/20230915183848.1018717-1-kernel@xxxxxxxxxxxxxxxx [2] https://github.com/linux-kdevops/kdevops/blob/master/docs/lbs.md Luis Chamberlain (10): bdev: rename iomap aops bdev: dynamically set aops to enable LBS support bdev: increase bdev max blocksize depending on the aops used filesystems: add filesytem buffer-head flag bdev: allow to switch between bdev aops bdev: simplify coexistance nvme: enhance max supported LBA format check nvme: add awun / nawun sanity check nvme: add nvme_core.debug_large_atomics to force high awun as phys_bs nvme: enable LBS support block/bdev.c | 73 ++++++++++++++++++++++++++++++++++++++-- block/blk.h | 1 + block/fops.c | 14 ++++---- drivers/nvme/host/core.c | 70 +++++++++++++++++++++++++++++++++++--- drivers/nvme/host/nvme.h | 1 + fs/adfs/super.c | 2 +- fs/affs/super.c | 2 +- fs/befs/linuxvfs.c | 2 +- fs/bfs/inode.c | 2 +- fs/efs/super.c | 2 +- fs/exfat/super.c | 2 +- fs/ext2/super.c | 2 +- fs/ext4/super.c | 7 ++-- fs/f2fs/super.c | 2 +- fs/fat/namei_msdos.c | 2 +- fs/fat/namei_vfat.c | 2 +- fs/freevxfs/vxfs_super.c | 2 +- fs/gfs2/ops_fstype.c | 4 +-- fs/hfs/super.c | 2 +- fs/hfsplus/super.c | 2 +- fs/isofs/inode.c | 2 +- fs/jfs/super.c | 2 +- fs/minix/inode.c | 2 +- fs/nilfs2/super.c | 2 +- fs/ntfs/super.c | 2 +- fs/ntfs3/super.c | 2 +- fs/ocfs2/super.c | 2 +- fs/omfs/inode.c | 2 +- fs/qnx4/inode.c | 2 +- fs/qnx6/inode.c | 2 +- fs/reiserfs/super.c | 2 +- fs/super.c | 3 +- fs/sysv/super.c | 4 +-- fs/udf/super.c | 2 +- fs/ufs/super.c | 2 +- include/linux/blkdev.h | 7 ++++ include/linux/fs.h | 1 + 37 files changed, 189 insertions(+), 48 deletions(-) -- 2.39.2