Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges

Jeff Layton <jlayton@xxxxxxxxxx> · Fri, 01 Mar 2024 08:16:44 -0500

On Tue, 2024-02-27 at 08:06 -0800, Darrick J. Wong wrote:
> On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> > > > 
> > > > Hi all,
> > > > 
> > > > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > > > ranges of bytes between two files atomically.  This new functionality
> > > > enables data storage programs to stage and commit file updates such that
> > > > reader programs will see either the old contents or the new contents in
> > > > their entirety, with no chance of torn writes.  A successful call
> > > > completion guarantees that the new contents will be seen even if the
> > > > system fails.
> > > > 
> > > > The ability to exchange file fork mappings between files in this manner
> > > > is critical to supporting online filesystem repair, which is built upon
> > > > the strategy of constructing a clean copy of a damaged structure and
> > > > committing the new structure into the metadata file atomically.
> > > > 
> > > > User programs will be able to update files atomically by opening an
> > > > O_TMPFILE, reflinking the source file to it, making whatever updates
> > > > they want to make, and exchange the relevant ranges of the temp file
> > > > with the original file.  If the updates are aligned with the file block
> > > > size, a new (since v2) flag provides for exchanging only the written
> > > > areas.  Callers can arrange for the update to be rejected if the
> > > > original file has been changed.
> > > > 
> > > > The intent behind this new userspace functionality is to enable atomic
> > > > rewrites of arbitrary parts of individual files.  For years, application
> > > > programmers wanting to ensure the atomicity of a file update had to
> > > > write the changes to a new file in the same directory, fsync the new
> > > > file, rename the new file on top of the old filename, and then fsync the
> > > > directory.  People get it wrong all the time, and $fs hacks abound.
> > > > Here are the proposed manual pages:
> > > > 
> > 
> > This is a cool idea!  I've had some handwavy ideas about making a gated
> > write() syscall (i.e. only write if the change cookie hasn't changed),
> > but something like this may be a simpler lift initially.
> 
> How /does/ userspace get at the change cookie nowadays?
> 

Today, it doesn't. That would need to be exposed before we could make
that work.

> > > > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> > > > 
> > > > NAME
> > > >        ioctl_xfs_exchange_range  -  exchange  the contents of parts of
> > > >        two files
> > > > 
> > > > SYNOPSIS
> > > >        #include <sys/ioctl.h>
> > > >        #include <xfs/xfs_fs_staging.h>
> > > > 
> > > >        int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
> > > >        xfs_exch_range *arg);
> > > > 
> > > > DESCRIPTION
> > > >        Given  a  range  of bytes in a first file file1_fd and a second
> > > >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> > > >        changes the contents of the two ranges.
> > > > 
> > > >        Exchanges  are  atomic  with  regards to concurrent file opera‐
> > > >        tions, so no userspace-level locks need to be taken  to  obtain
> > > >        consistent  results.  Implementations must guarantee that read‐
> > > >        ers see either the old contents or the new  contents  in  their
> > > >        entirety, even if the system fails.
> > > > 
> > > >        The  system  call  parameters are conveyed in structures of the
> > > >        following form:
> > > > 
> > > >            struct xfs_exch_range {
> > > >                __s64    file1_fd;
> > > >                __s64    file1_offset;
> > > >                __s64    file2_offset;
> > > >                __s64    length;
> > > >                __u64    flags;
> > > > 
> > > >                __u64    pad;
> > > >            };
> > > > 
> > > >        The field pad must be zero.
> > > > 
> > > >        The fields file1_fd, file1_offset, and length define the  first
> > > >        range of bytes to be exchanged.
> > > > 
> > > >        The fields file2_fd, file2_offset, and length define the second
> > > >        range of bytes to be exchanged.
> > > > 
> > > >        Both files must be from the same filesystem mount.  If the  two
> > > >        file  descriptors represent the same file, the byte ranges must
> > > >        not overlap.  Most  disk-based  filesystems  require  that  the
> > > >        starts  of  both ranges must be aligned to the file block size.
> > > >        If this is the case, the ends of the ranges  must  also  be  so
> > > >        aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> > > > 
> > > >        The field flags control the behavior of the exchange operation.
> > > > 
> > > >            XFS_EXCHRANGE_TO_EOF
> > > >                   Ignore  the length parameter.  All bytes in file1_fd
> > > >                   from file1_offset to EOF are moved to file2_fd,  and
> > > >                   file2's  size is set to (file2_offset+(file1_length-
> > > >                   file1_offset)).  Meanwhile, all bytes in file2  from
> > > >                   file2_offset  to  EOF are moved to file1 and file1's
> > > >                   size   is   set   to    (file1_offset+(file2_length-
> > > >                   file2_offset)).
> > > > 
> > > >            XFS_EXCHRANGE_DSYNC
> > > >                   Ensure  that  all modified in-core data in both file
> > > >                   ranges and all metadata updates  pertaining  to  the
> > > >                   exchange operation are flushed to persistent storage
> > > >                   before the call returns.  Opening  either  file  de‐
> > > >                   scriptor  with  O_SYNC or O_DSYNC will have the same
> > > >                   effect.
> > > > 
> > > >            XFS_EXCHRANGE_FILE1_WRITTEN
> > > >                   Only exchange sub-ranges of file1_fd that are  known
> > > >                   to  contain  data  written  by application software.
> > > >                   Each sub-range may be  expanded  (both  upwards  and
> > > >                   downwards)  to  align with the file allocation unit.
> > > >                   For files on the data device, this is one filesystem
> > > >                   block.   For  files  on the realtime device, this is
> > > >                   the realtime extent size.  This facility can be used
> > > >                   to  implement  fast  atomic scatter-gather writes of
> > > >                   any complexity for software-defined storage  targets
> > > >                   if  all  writes  are  aligned to the file allocation
> > > >                   unit.
> > > > 
> > > >            XFS_EXCHRANGE_DRY_RUN
> > > >                   Check the parameters and the feasibility of the  op‐
> > > >                   eration, but do not change anything.
> > > > 
> > > > RETURN VALUE
> > > >        On  error, -1 is returned, and errno is set to indicate the er‐
> > > >        ror.
> > > > 
> > > > ERRORS
> > > >        Error codes can be one of, but are not limited to, the  follow‐
> > > >        ing:
> > > > 
> > > >        EBADF  file1_fd  is not open for reading and writing or is open
> > > >               for append-only writes; or  file2_fd  is  not  open  for
> > > >               reading and writing or is open for append-only writes.
> > > > 
> > > >        EINVAL The  parameters  are  not correct for these files.  This
> > > >               error can also appear if either file  descriptor  repre‐
> > > >               sents  a device, FIFO, or socket.  Disk filesystems gen‐
> > > >               erally require the offset and  length  arguments  to  be
> > > >               aligned to the fundamental block sizes of both files.
> > > > 
> > > >        EIO    An I/O error occurred.
> > > > 
> > > >        EISDIR One of the files is a directory.
> > > > 
> > > >        ENOMEM The  kernel  was unable to allocate sufficient memory to
> > > >               perform the operation.
> > > > 
> > > >        ENOSPC There is not enough free space  in  the  filesystem  ex‐
> > > >               change the contents safely.
> > > > 
> > > >        EOPNOTSUPP
> > > >               The filesystem does not support exchanging bytes between
> > > >               the two files.
> > > > 
> > > >        EPERM  file1_fd or file2_fd are immutable.
> > > > 
> > > >        ETXTBSY
> > > >               One of the files is a swap file.
> > > > 
> > > >        EUCLEAN
> > > >               The filesystem is corrupt.
> > > > 
> > > >        EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
> > > >               filesystem.
> > > > 
> > > > CONFORMING TO
> > > >        This API is XFS-specific.
> > > > 
> > > > USE CASES
> > > >        Several  use  cases  are imagined for this system call.  In all
> > > >        cases, application software must coordinate updates to the file
> > > >        because the exchange is performed unconditionally.
> > > > 
> > > >        The  first  is a data storage program that wants to commit non-
> > > >        contiguous updates to a file atomically and  coordinates  write
> > > >        access  to that file.  This can be done by creating a temporary
> > > >        file, calling FICLONE(2) to share the contents, and staging the
> > > >        updates into the temporary file.  The FULL_FILES flag is recom‐
> > > >        mended for this purpose.  The temporary file can be deleted  or
> > > >        punched out afterwards.
> > > > 
> > > >        An example program might look like this:
> > > > 
> > > >            int fd = open("/some/file", O_RDWR);
> > > >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > > 
> > > >            ioctl(temp_fd, FICLONE, fd);
> > > > 
> > > >            /* append 1MB of records */
> > > >            lseek(temp_fd, 0, SEEK_END);
> > > >            write(temp_fd, data1, 1000000);
> > > > 
> > > >            /* update record index */
> > > >            pwrite(temp_fd, data1, 600, 98765);
> > > >            pwrite(temp_fd, data2, 320, 54321);
> > > >            pwrite(temp_fd, data2, 15, 0);
> > > > 
> > > >            /* commit the entire update */
> > > >            struct xfs_exch_range args = {
> > > >                .file1_fd = temp_fd,
> > > >                .flags = XFS_EXCHRANGE_TO_EOF,
> > > >            };
> > > > 
> > > >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > > 
> > > >        The  second  is  a  software-defined  storage host (e.g. a disk
> > > >        jukebox) which implements an atomic scatter-gather  write  com‐
> > > >        mand.   Provided the exported disk's logical block size matches
> > > >        the file's allocation unit size, this can be done by creating a
> > > >        temporary file and writing the data at the appropriate offsets.
> > > >        It is recommended that the temporary file be truncated  to  the
> > > >        size  of  the  regular file before any writes are staged to the
> > > >        temporary file to avoid issues with zeroing during  EOF  exten‐
> > > >        sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
> > > >        only the file allocation units involved  in  the  emulated  de‐
> > > >        vice's  write  command.  The temporary file should be truncated
> > > >        or punched out completely before being reused to stage  another
> > > >        write.
> > > > 
> > > >        An example program might look like this:
> > > > 
> > > >            int fd = open("/some/file", O_RDWR);
> > > >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > >            struct stat sb;
> > > >            int blksz;
> > > > 
> > > >            fstat(fd, &sb);
> > > >            blksz = sb.st_blksize;
> > > > 
> > > >            /* land scatter gather writes between 100fsb and 500fsb */
> > > >            pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> > > >            pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> > > >            pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> > > > 
> > > >            /* commit the entire update */
> > > >            struct xfs_exch_range args = {
> > > >                .file1_fd = temp_fd,
> > > >                .file1_offset = blksz * 100,
> > > >                .file2_offset = blksz * 100,
> > > >                .length       = blksz * 400,
> > > >                .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
> > > >                                XFS_EXCHRANGE_FILE1_DSYNC,
> > > >            };
> > > > 
> > > >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > > 
> > > > NOTES
> > > >        Some  filesystems may limit the amount of data or the number of
> > > >        extents that can be exchanged in a single call.
> > > > 
> > > > SEE ALSO
> > > >        ioctl(2)
> > > > 
> > > > XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
> > > > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> > > > 
> > > > NAME
> > > >        ioctl_xfs_commit_range - conditionally exchange the contents of
> > > >        parts of two files
> > > > 
> > > > SYNOPSIS
> > > >        #include <sys/ioctl.h>
> > > >        #include <xfs/xfs_fs_staging.h>
> > > > 
> > > >        int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
> > > >        mit_range *arg);
> > > > 
> > > > DESCRIPTION
> > > >        Given  a  range  of bytes in a first file file1_fd and a second
> > > >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> > > >        changes  the contents of the two ranges if file2_fd passes cer‐
> > > >        tain freshness criteria.
> > > > 
> > > >        After locking both files but before  exchanging  the  contents,
> > > >        the  supplied  file2_ino field must match file2_fd's inode num‐
> > > >        ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
> > > >        file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> > > >        cation time and change time of file2.  If they  do  not  match,
> > > >        EBUSY will be returned.
> > > > 
> > > 
> > > Maybe a stupid question, but under which circumstances would mtime
> > > change and ctime not change? Why are both needed?
> > > 
> > 
> > ctime should always change if the mtime does. An mtime update means that
> > the metadata was updated, so you also need to update the ctime. 
> 
> Exactly. :)
> 
> > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > Is there a real need to support it on filesystems that do not support
> > > i_version(?)
> > > 
> > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > how other filesystems may update ctime in those cases.
> > > 
> > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > really explicitly requests a bump of i_version on the next change.
> > > 
> > 
> > 
> > I agree. Using an opaque change cookie would be a lot nicer from an API
> > standpoint, and shouldn't be subject to timestamp granularity issues.
> 
> TLDR: No.
> 
> > That said, XFS's change cookie is currently broken. Dave C. said he had
> > some patches in progress to fix that however.
> 
> Dave says that about a lot of things.  I'm not willing to delay the
> online fsck project _even further_ for a bunch of vaporware that's not
> even out on linux-xfs for review.
> 
> The difference in opinion between xfs and the rest of the kernel about
> i_version is 50% of why I didn't include it here.  The other 50% is the
> part where userspace can't access it, because I do not want to saddle my
> mostly internal project with YET ANOTHER ASK FROM RH PEOPLE FOR CORE
> CHANGES.

Ouch, point taken.

I just have grave concerns about using something as coarse-grained as
the  to gate changes to a file. With modern machines, a single timestamp
can represent a large number of different states of the file's contents.

Is that not a problem here?
-- 
Jeff Layton <jlayton@xxxxxxxxxx>