Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag

John Garry <john.g.garry@xxxxxxxxxx> · Tue, 24 Oct 2023 13:35:33 +0100

On 09/10/2023 22:05, Darrick J. Wong wrote:
If the file range is a sparse hole, the directio setup will allocate
space and create an unwritten mapping before issuing the write bio.  The
rest of the process works the same as preallocations and has the same
behaviors.

If the file range is allocated and was previously written, the write is
issued and that's all that's needed from the fs.  After a crash, reads
of the storage device produce the old contents or the new contents.
This is exactly what I explained when reviewing the code that
rejected RWF_ATOMIC without O_DSYNC on metadata dirty inodes.
I'm glad we agree. 😄

John, when you're back from vacation, can we get rid of this language
and all those checks under _is_dsync() in the iomap patch?

(That code is 100% the result of me handwaving and bellyaching 6 months
ago when the team was trying to get all the atomic writes bits working
prior to LSF and I was too burned out to think the xfs part through.
As a result, I decided that we'd only support strict overwrites for the
first iteration.)

So this following additive code in iomap_dio_bio_iter() should be dropped:

----8<-----

--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -275,10 +275,11 @@ static inline blk_opf_t 
iomap_dio_bio_opflags(struct iomap_dio *dio,
 static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		struct iomap_dio *dio)
 {

...

@@ -292,6 +293,13 @@ static loff_t iomap_dio_bio_iter(const struct 
iomap_iter *iter,
 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
 		return -EINVAL;

+	if (atomic_write && !iocb_is_dsync(dio->iocb)) {
+		if (iomap->flags & IOMAP_F_DIRTY)
+			return -EIO;
+		if (iomap->type != IOMAP_MAPPED)
+			return -EIO;
+	}
+

---->8-----

ok?


Summarizing:

An (ATOMIC|SYNC) request provides the strongest guarantees (data
will not be torn, and all file metadata updates are persisted before
the write is returned to userspace.  Programs see either the old data or
the new data, even if there's a crash.

(ATOMIC|DSYNC) is less strong -- data will not be torn, and any file
updates for just that region are persisted before the write is returned.

(ATOMIC) is the least strong -- data will not be torn.  Neither the
filesystem nor the device make guarantees that anything ended up on
stable storage, but if it does, programs see either the old data or the
new data.
Yup, that makes sense to me.
Perhaps this ^^ is what we should be documenting here.

Maybe we should rename the whole UAPI s/atomic/untorn/...
Perhaps, though "torn writes" is nomenclature that nobody outside
storage and filesystem developers really knows about. All I ever
hear from userspace developers is "we want atomic/all-or-nothing
data writes"...
Fair 'enuf.


Thanks,
John