This series introduces a new proposal to implementing atomic writes in the kernel. This series takes the approach of adding a new "atomic" flag to each of pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively. When set, these indicate that we want the write issued "atomically". I have seen a similar flag for pwritev2() touted on the lists previously. Only direct IO is supported and for block devices and xfs. The atomic writes feature requires dedicated HW support, like SCSI WRITE_ATOMIC_16 command. The goal here is to provide an interface that allow applications use application-specific block sizes larger than logical block size reported by the storage device or larger than filesystem block size as reported by stat(). With this new interface, application blocks will never be torn or fractured. For a power fail, for each individual application block, all or none of the data to be written. A racing atomic write and read will mean that the read sees all the old data or all the new data, but never a mix of old and new. Two new fields are added to struct statx - atomic_write_unit_min and atomic_write_unit_max. These values are always a power-of-two and indicate the inclusive min and max block size which the userspace application may use. The application block size must be a power-of-two. For each atomic individual write, the total length of a write must be a multiple of this application block size and must also be at a file offset which is naturally aligned on that block size. Otherwise, the kernel cannot know the application block size and what sort of splitting into BIOs is permissible. The kernel guarantees to write at least each individual application block atomically. However, there is no guarantee to atomically write all data for multiple blocks. As an example of usage, for a 32KB application block size, userspace may request a 64KB write at 96KB offset, which the kernel will submit to HW as 2x 32KB individual atomic write operations. Since xfs uses iomap and extents there may be discontiguous, we must ensure that extents have specific alignments to support atomic writes. For this, we add a new experimental variant of fallocate for xfs, fallocate2, which takes an alignment arg, and should align any extents on that value. In practice, it must be same value of atomic_write_unit_max for the backing block device. This allows the user to submit atomic writes which may span multiple discontig extents. This does not fully work yet, as extents may later change and any new extents will not know about this initial alignment requirement. Another option is to use XFS realtime volumes, which does allow alignment to be specified via extsize arg. In both cases, we should ensure extents are in written state prior to any atomic writes. SCSI sd.c and scsi_debug and NVMe kernel support is added. We also have QEMU NVMe support and we hope to share in coming days. We are sending as an RFC so we can share the code prior to LSFMM. This series is based on v6.3 Alan Adamson (1): nvme: Support atomic writes Allison Henderson (1): xfs: Add support for fallocate2 Himanshu Madhani (2): block: Add atomic write operations to request_queue limits block: Add REQ_ATOMIC flag John Garry (10): xfs: Support atomic write for statx block: Limit atomic writes according to bio and queue limits block: Add bdev_find_max_atomic_write_alignment() block: Add support for atomic_write_unit block: Add blk_validate_atomic_write_op() block: Add fops atomic write support fs: iomap: Atomic write support scsi: sd: Support reading atomic properties from block limits VPD scsi: sd: Add WRITE_ATOMIC_16 support scsi: scsi_debug: Atomic write support Prasad Singamsetty (2): fs/bdev: Add atomic write support info to statx fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support Documentation/ABI/stable/sysfs-block | 42 ++ block/bdev.c | 60 +++ block/bio.c | 7 +- block/blk-core.c | 28 ++ block/blk-merge.c | 84 +++- block/blk-settings.c | 73 ++++ block/blk-sysfs.c | 33 ++ block/fops.c | 56 ++- drivers/nvme/host/core.c | 33 ++ drivers/scsi/scsi_debug.c | 593 +++++++++++++++++++++------ drivers/scsi/scsi_trace.c | 22 + drivers/scsi/sd.c | 54 ++- drivers/scsi/sd.h | 7 + fs/iomap/direct-io.c | 72 +++- fs/stat.c | 10 + fs/xfs/Makefile | 1 + fs/xfs/libxfs/xfs_attr_remote.c | 2 +- fs/xfs/libxfs/xfs_bmap.c | 9 +- fs/xfs/libxfs/xfs_bmap.h | 4 +- fs/xfs/libxfs/xfs_da_btree.c | 4 +- fs/xfs/libxfs/xfs_fs.h | 1 + fs/xfs/xfs_bmap_util.c | 7 +- fs/xfs/xfs_bmap_util.h | 2 +- fs/xfs/xfs_dquot.c | 2 +- fs/xfs/xfs_file.c | 19 +- fs/xfs/xfs_fs_staging.c | 99 +++++ fs/xfs/xfs_fs_staging.h | 21 + fs/xfs/xfs_ioctl.c | 4 + fs/xfs/xfs_iomap.c | 4 +- fs/xfs/xfs_iops.c | 10 + fs/xfs/xfs_reflink.c | 4 +- fs/xfs/xfs_rtalloc.c | 2 +- fs/xfs/xfs_symlink.c | 2 +- include/linux/blk_types.h | 4 + include/linux/blkdev.h | 36 ++ include/linux/fs.h | 1 + include/linux/stat.h | 2 + include/scsi/scsi_proto.h | 1 + include/uapi/linux/fs.h | 5 +- include/uapi/linux/stat.h | 7 +- security/security.c | 1 + tools/include/uapi/linux/fs.h | 5 +- 42 files changed, 1257 insertions(+), 176 deletions(-) create mode 100644 fs/xfs/xfs_fs_staging.c create mode 100644 fs/xfs/xfs_fs_staging.h -- 2.31.1