This series introduces a proposal to implementing atomic writes in the kernel for torn-write protection. This series takes the approach of adding a new "atomic" flag to each of pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively. When set, these indicate that we want the write issued "atomically". Only direct IO is supported and for block devices here. For this, atomic write HW is required, like SCSI ATOMIC WRITE (16). XFS FS support has previously been posted at: https://lore.kernel.org/linux-xfs/20240304130428.13026-1-john.g.garry@xxxxxxxxxx/ I am working on a new version of that series, which I hope to post soon. Updated man pages have been posted at: https://lore.kernel.org/lkml/20240124112731.28579-1-john.g.garry@xxxxxxxxxx/T/#m520dca97a9748de352b5a723d3155a4bb1e46456 The goal here is to provide an interface that allows applications use application-specific block sizes larger than logical block size reported by the storage device or larger than filesystem block size as reported by stat(). With this new interface, application blocks will never be torn or fractured when written. For a power fail, for each individual application block, all or none of the data to be written. A racing atomic write and read will mean that the read sees all the old data or all the new data, but never a mix of old and new. Three new fields are added to struct statx - atomic_write_unit_min, atomic_write_unit_max, and atomic_write_segments_max. For each atomic individual write, the total length of a write must be a between atomic_write_unit_min and atomic_write_unit_max, inclusive, and a power-of-2. The write must also be at a natural offset in the file wrt the write length. For pwritev2, iovcnt is limited by atomic_write_segments_max. There has been some discussion on supporting buffered IO and whether the API is suitable, like: https://lore.kernel.org/linux-nvme/ZeembVG-ygFal6Eb@xxxxxxxxxxxxxxxxxxxx/ Specifically the concern is that supporting a range of sizes of atomic IO in the pagecache is complex to support. For this, my idea is that FSes can fix atomic_write_unit_min and atomic_write_unit_max at the same size, the extent alignment size, which should be easier to support. We may need to implement O_ATOMIC to avoid mixing atomic and non-atomic IOs for this. I have no proposed solution for atomic write buffered IO for bdev file operations, but I know of no requirement for this. SCSI sd.c and scsi_debug and NVMe kernel support is added. This series is based on v6.9-rc1 Patches can be found at: https://github.com/johnpgarry/linux/commits/atomic-writes-v6.9-v6 Changes since v5: - Rebase and update NVMe support for new request_queue limits API - Keith, please check since I still have your RB tag - Change request_queue limits to byte-based sizes to suit new queue limits API - Pass rw_type to io_uring io_rw_init_file() (Jens) - Add BLK_STS_INVAL - Don't check size in generic_atomic_write_valid() Alan Adamson (1): nvme: Atomic write support John Garry (6): block: Pass blk_queue_get_max_sectors() a request pointer block: Call blkdev_dio_unaligned() from blkdev_direct_IO() block: Add core atomic write support block: Add fops atomic write support scsi: sd: Atomic write support scsi: scsi_debug: Atomic write support Prasad Singamsetty (3): fs: Initial atomic write support fs: Add initial atomic write support info to statx block: Add atomic write support for statx Documentation/ABI/stable/sysfs-block | 52 +++ block/bdev.c | 36 +- block/blk-core.c | 19 + block/blk-merge.c | 98 ++++- block/blk-mq.c | 2 +- block/blk-settings.c | 109 +++++ block/blk-sysfs.c | 33 ++ block/blk.h | 9 +- block/fops.c | 47 ++- drivers/nvme/host/core.c | 49 +++ drivers/scsi/scsi_debug.c | 588 +++++++++++++++++++++------ drivers/scsi/scsi_trace.c | 22 + drivers/scsi/sd.c | 93 ++++- drivers/scsi/sd.h | 8 + fs/aio.c | 8 +- fs/btrfs/ioctl.c | 2 +- fs/read_write.c | 2 +- fs/stat.c | 50 ++- include/linux/blk_types.h | 8 +- include/linux/blkdev.h | 67 ++- include/linux/fs.h | 36 +- include/linux/stat.h | 3 + include/scsi/scsi_proto.h | 1 + include/trace/events/scsi.h | 1 + include/uapi/linux/fs.h | 5 +- include/uapi/linux/stat.h | 9 +- io_uring/rw.c | 8 +- 27 files changed, 1173 insertions(+), 192 deletions(-) -- 2.31.1