Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 29 Feb 2024 12:18:40 -0800

On Tue, Feb 27, 2024 at 08:50:20PM -0500, Colin Walters wrote:
> 
> 
> On Mon, Feb 26, 2024, at 9:18 PM, Darrick J. Wong wrote:
> > Hi all,
> >
> > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > ranges of bytes between two files atomically.  This new functionality
> > enables data storage programs to stage and commit file updates such that
> > reader programs will see either the old contents or the new contents in
> > their entirety, with no chance of torn writes.  A successful call
> > completion guarantees that the new contents will be seen even if the
> > system fails.
> >
> > The ability to exchange file fork mappings between files in this manner
> > is critical to supporting online filesystem repair, which is built upon
> > the strategy of constructing a clean copy of a damaged structure and
> > committing the new structure into the metadata file atomically.
> >
> > User programs will be able to update files atomically by opening an
> > O_TMPFILE, reflinking the source file to it, making whatever updates
> > they want to make, and exchange the relevant ranges of the temp file
> > with the original file. 
> 
> It's probably worth noting that the "reflinking the source file" here
> is optional, right?  IOW one can just:
> 
> - open(O_TMPFILE)
> - write()
> - ioctl(FIEXCHANGE_RANGE)

If the write() rewrites the entire file, then yes, that'll also work.

> I suspect the "simpler" non-database cases (think e.g. editors
> operating on plain text files) are going to be operating on an
> in-memory copy; in theory of course we could identify common ranges
> and reflink, but it's not clear to me it's really worth it at the
> tiny scale most source files are.

Correct, there's no built-in dedupe.  For small files you'll probably
end up with a single allocation anyway, which is ideal in terms of
ondisk metadata overhead.

One advantage that EXCHANGE_RANGE has over the rename dance is that the
calling application doesn't have to copy all the file attributes and
xattrs to the temporary file before the switch.

> > The intent behind this new userspace functionality is to enable atomic
> > rewrites of arbitrary parts of individual files.  For years, application
> > programmers wanting to ensure the atomicity of a file update had to
> > write the changes to a new file in the same directory
> 
> More sophisticated tools already are using O_TMPFILE I would say,
> just with a final last step of materializing it with a name,
> and then rename() into place.  So if this also
> obviates the need for
> https://lore.kernel.org/linux-fsdevel/364531.1579265357@xxxxxxxxxxxxxxxxxxxxxx/
> that seems good.

It would, though I would bet that extending linkat (or rename, or
whatever) is going to be the only workable solution for old / simple
filesystems (e.g. fat32).

> >        Exchanges  are  atomic  with  regards to concurrent file opera‐
> >        tions, so no userspace-level locks need to be taken  to  obtain
> >        consistent  results.  Implementations must guarantee that read‐
> >        ers see either the old contents or the new  contents  in  their
> >        entirety, even if the system fails.
> 
> But given that we're reusing the same inode, I don't think that can
> *really* be true...at least, not without higher level serialization.

Higher level coordination is required, yes.  It doesn't have to be
serialization, though.  The committing thread could signal all the other
readers that they should invalidate and restart whatever they're working
on if that work depends on the file that was COMMIT_RANGE'd.  The
readers could detect unexpected data and resample mtime of the files
they've read and restart if it's changed.

> A classic case today is dconf in GNOME is a basic memory-mapped
> database file that is atomically replaced by the "create new file,
> rename into place" model.  Clients with mmap() view just see the old
> data until they reload explicitly.  But with this, clients with mmap'd
> view *will* immediately see the new contents (because it's the same
> inode, right?)

Correct, they'll start seeing the new contents as soon as they access
the affected pages.

How /does/ dconf handle those changes?  Does it rename the file and
signal all the other dconf threads to reopen the file?  And then those
threads get the new file contents?

>                and that's just going to lead to possibly split reads
> and undefined behavior - without extra userspace serialization or
> locking (that more proper databases) are going to be doing.

Huurrrh hurrrh.  That's right, I don't see how exchange can mesh well
with mmap without actual flock()ing. :(

fsnotify will send a message out to userspace after the exchange
finishes, which means that userspace could watch for the notifications
via fanotify.  However, that's still a bit racy... :/

> Arguably of course, dconf is too simple and more sophisticated tools
> like sqlite or LMDB could make use of this.  (There's some special
> atomic write that got added to f2fs for sqlite last I saw...I'm
> curious if this could replace it)

I think so:

F2FS_IOC_START_ATOMIC_WRITE -> XFS_IOC_START_COMMIT,
F2FS_IOC_COMMIT_ATOMIC_WRITE -> XFS_IOC_COMMIT_RANGE, and
F2FS_IOC_ABORT_VOLATILE_WRITE merely turns into close(temp_fd);

> But still...it seems to me like there's going to be quite a lot of the
> "potentially concurrent reader, atomic replace desired" pattern and
> since this can't replace that, we should call that out explicitly in
> the man page.  And also if so, then there's still a need for the
> linkat(AT_REPLACE) etc.

Hmm, I think I'll shrink that paragraph of the manpage:

"Exchanges are atomic with regards to concurrent file operations.
Implementations must guarantee that readers see either the old contents
or the new contents in their entirety, even if the system fails."

> 
> >            XFS_EXCHRANGE_TO_EOF
> 
> I kept reading this as some sort of typo...would it really be too
> onerous to spell it out as XFS_EXCHANGE_RANGE_TO_EOF e.g.?  Echoes of
> unix "creat" here =)

Yeah, I've expanded that to XFS_EXCHANGE_RANGE_TO_EOF for v29.5.

--D