On Mon, Oct 09, 2023 at 10:44:38AM -0700, Darrick J. Wong wrote: > On Fri, Sep 29, 2023 at 09:37:15AM +0000, John Garry wrote: > > From: Himanshu Madhani <himanshu.madhani@xxxxxxxxxx> > > > > Add RWF_ATOMIC flag description for pwritev2(). > > > > Signed-off-by: Himanshu Madhani <himanshu.madhani@xxxxxxxxxx> > > #jpg: complete rewrite > > Signed-off-by: John Garry <john.g.garry@xxxxxxxxxx> > > --- > > man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 45 insertions(+) .... > > +For when regular files are opened with > > +.BR open (2) > > +but without > > +.B O_SYNC > > +or > > +.B O_DSYNC > > +and the > > +.BR pwritev2() > > +call is made without > > +.B RWF_SYNC > > +or > > +.BR RWF_DSYNC > > +set, the range metadata must already be flushed to storage and the data range > > +must not be in unwritten state, shared, a preallocation, or a hole. > > I think that we can drop all of these flags requirements, since the > contiguous small space allocation requirement means that the fs can > provide all-or-nothing writes even if metadata updates are needed: > > If the file range is allocated and marked unwritten (i.e. a > preallocation), the ioend will clear the unwritten bit from the file > mapping atomically. After a crash, the application sees either zeroes > or all the data that was written. > > If the file range is shared, the ioend will map the COW staging extent > into the file atomically. After a crash, the application sees either > the old contents from the old blocks, or the new contents from the new > blocks. > > If the file range is a sparse hole, the directio setup will allocate > space and create an unwritten mapping before issuing the write bio. The > rest of the process works the same as preallocations and has the same > behaviors. > > If the file range is allocated and was previously written, the write is > issued and that's all that's needed from the fs. After a crash, reads > of the storage device produce the old contents or the new contents. This is exactly what I explained when reviewing the code that rejected RWF_ATOMIC without O_DSYNC on metadata dirty inodes. > Summarizing: > > An (ATOMIC|SYNC) request provides the strongest guarantees (data > will not be torn, and all file metadata updates are persisted before > the write is returned to userspace. Programs see either the old data or > the new data, even if there's a crash. > > (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file > updates for just that region are persisted before the write is returned. > > (ATOMIC) is the least strong -- data will not be torn. Neither the > filesystem nor the device make guarantees that anything ended up on > stable storage, but if it does, programs see either the old data or the > new data. Yup, that makes sense to me. > Maybe we should rename the whole UAPI s/atomic/untorn/... Perhaps, though "torn writes" is nomenclature that nobody outside storage and filesystem developers really knows about. All I ever hear from userspace developers is "we want atomic/all-or-nothing data writes"... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx