Re: [PATCH v1 9/8] copy_file_range.2: New page documenting copy_file_range()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 09, 2015 at 01:31:24PM -0400, Anna Schumaker wrote:
> On 09/09/2015 01:17 PM, Darrick J. Wong wrote:
> > On Wed, Sep 09, 2015 at 07:38:14AM -0400, Austin S Hemmelgarn wrote:
> >> On 2015-09-08 16:39, Darrick J. Wong wrote:
> >>> On Tue, Sep 08, 2015 at 11:04:03AM -0400, Anna Schumaker wrote:
> >>>> On 09/04/2015 05:38 PM, Darrick J. Wong wrote:
> >>>>> On Fri, Sep 04, 2015 at 04:17:03PM -0400, Anna Schumaker wrote:
> >>>>>> copy_file_range() is a new system call for copying ranges of data
> >>>>>> completely in the kernel.  This gives filesystems an opportunity to
> >>>>>> implement some kind of "copy acceleration", such as reflinks or
> >>>>>> server-side-copy (in the case of NFS).
> >>>>>>
> >>>>>> Signed-off-by: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx>
> >>>>>> ---
> >>>>>>  man2/copy_file_range.2 | 168 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>  1 file changed, 168 insertions(+)
> >>>>>>  create mode 100644 man2/copy_file_range.2
> >>>>>>
> >>>>>> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> >>>>>> new file mode 100644
> >>>>>> index 0000000..4a4cb73
> >>>>>> --- /dev/null
> >>>>>> +++ b/man2/copy_file_range.2
> >>>>>> @@ -0,0 +1,168 @@
> >>>>>> +.\"This manpage is Copyright (C) 2015 Anna Schumaker <Anna.Schumaker@xxxxxxxxxx>
> >>>>>> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual"
> >>>>>> +.SH NAME
> >>>>>> +copy_file_range \- Copy a range of data from one file to another
> >>>>>> +.SH SYNOPSIS
> >>>>>> +.nf
> >>>>>> +.B #include <linux/copy.h>
> >>>>>> +.B #include <sys/syscall.h>
> >>>>>> +.B #include <unistd.h>
> >>>>>> +
> >>>>>> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ",
> >>>>>> +.BI "                int " fd_out ", loff_t * " off_out ", size_t " len ",
> >>>>>> +.BI "                unsigned int " flags );
> >>>>>> +.fi
> >>>>>> +.SH DESCRIPTION
> >>>>>> +The
> >>>>>> +.BR copy_file_range ()
> >>>>>> +system call performs an in-kernel copy between two file descriptors
> >>>>>> +without all that tedious mucking about in userspace.
> >>>>>
> >>>>> ;)
> >>>>>
> >>>>>> +It copies up to
> >>>>>> +.I len
> >>>>>> +bytes of data from file descriptor
> >>>>>> +.I fd_in
> >>>>>> +to file descriptor
> >>>>>> +.I fd_out
> >>>>>> +at
> >>>>>> +.IR off_out .
> >>>>>> +The file descriptors must not refer to the same file.
> >>>>>
> >>>>> Why?  btrfs (and XFS) reflink can handle the case of a file sharing blocks
> >>>>> with itself.
> >>>>
> >>>> I've never really thought about it... Zach had that in his initial
> >>>> submission, so mentioned it in the man page.  Should I remove that bit?
> >>>
> >>> Yes, please!
> >>>
> >>> I could be wrong, but I think btrfs only started supporting files that share
> >>> blocks with themselves relatively recently(?)
> >>>
> >>> I'm not sure why zab added this; was hoping he'd speak up. ;)
> >>>
> >>>>
> >>>>>
> >>>>>> +
> >>>>>> +The following semantics apply for
> >>>>>> +.IR fd_in ,
> >>>>>> +and similar statements apply to
> >>>>>> +.IR off_out :
> >>>>>> +.IP * 3
> >>>>>> +If
> >>>>>> +.I off_in
> >>>>>> +is NULL, then bytes are read from
> >>>>>> +.I fd_in
> >>>>>> +starting from the current file offset and the current
> >>>>>> +file offset is adjusted appropriately.
> >>>>>> +.IP *
> >>>>>> +If
> >>>>>> +.I off_in
> >>>>>> +is not NULL, then
> >>>>>> +.I off_in
> >>>>>> +must point to a buffer that specifies the starting
> >>>>>> +offset where bytes from
> >>>>>> +.I fd_in
> >>>>>> +will be read.  The current file offset of
> >>>>>> +.I fd_in
> >>>>>> +is not changed, but
> >>>>>> +.I off_in
> >>>>>> +is adjusted appropriately.
> >>>>>> +.PP
> >>>>>> +The default behavior of
> >>>>>> +.BR copy_file_range ()
> >>>>>> +is filesystem specific, and might result in creating a
> >>>>>> +copy-on-write reflink.
> >>>>>> +In the event that a given filesystem does not implement
> >>>>>> +any form of copy acceleration, the kernel will perform
> >>>>>> +a deep copy of the requested range by reading bytes from
> >>>>>
> >>>>> I wonder if it's wise to allow deep copies -- what happens if len == 1T?
> >>>>> Will this syscall just block for a really long time?
> >>>>
> >>>> We use rw_verify_area(), (similar to read and write) so we won't allow a
> >>>> value of len that long.  I can mention this in an updated version of this man
> >>>> page!
> >>>
> >>> Ok.  I guess MAX_RW_COUNT limits us to about 4G at once, which for a splice
> > 
> > Heh, INT_MAX, so 2GB at once.
> > 
> >>> copy is probably reasonable.
> >>>
> >>> The reason why I asked about len == 1T specifically is that I can (with
> >>> somewhat long delays) reflink about 260 million extents at a time on XFS,
> >>> which is about 1TB.  Given that locks get held for the duration, it's probably
> >>> not a bad thing to limit userspace to 4G at a time.
> >>
> >> I'd personally love to see that be tunable by a sysctl (kind of like
> >> how you can control the maximum number of AIO requests in flight),
> >> and for that matter we might want to be able to limit the number of
> >> in-progress copies going on.
> > 
> > Now that I think about it, btrfs' reflink ioctl doesn't seem to have any
> > particular limit on how much you can reflink in a single call.  XFS doesn't
> > have a limit either.  Given that reflink should create a tiny amount of IO
> > compared to the number of bytes being manipulated, should we allow a higher
> > limit when ssize_t is large enough?
> > 
> > Copy-through-the-pagecache should stick to MAX_RW_COUNT.
> 
> Should I keep rejecting pagecache copies if len > MAX_RW_COUNT?  Or would it
> be okay to change the value of len to MAX_RW_COUNT in this case?

OH.  Heh.

rw_verify_area returns either an error code or a len that's been clamped to
MAX_RW_COUNT.  However, the syscall code only checks for errors, and otherwise
ignores the clamp.  So I guess the length has never been clamped.

Since the syscall returns ssize_t, I think it's fine to keep around the return
value from rw_verify_area and use it to clamp len if we have to fall back on
pagecache copy.  Otherwise we'll let each FS' copy routine decide its maximum.

--D

> 
> Anna
> 
> > 
> > I noticed that btrfs won't dedupe more than 16M per call.  Any thoughts?
> > 
> > --D
> > 
> >>>
> >>> (But hey, it's fun to stress-test once in a while. :))
> >>>
> >>> --D
> >>>
> >>>>
> >>>>
> >>>>>
> >>>>>> +.I fd_in
> >>>>>> +and writing them to
> >>>>>> +.IR fd_out .
> >>>>>
> >>>>> "...if COPY_REFLINK is not set in flags."
> >>>>
> >>>> Sure.
> >>>>
> >>>>>
> >>>>>> +
> >>>>>> +Currently, Linux only supports the following flag:
> >>>>>> +.TP 1.9i
> >>>>>> +.B COPY_REFLINK
> >>>>>> +Only perform the copy if the filesystem can do it as a reflink.
> >>>>>> +Do not fall back on performing a deep copy.
> >>>>>> +.SH RETURN VALUE
> >>>>>> +Upon successful completion,
> >>>>>> +.BR copy_file_range ()
> >>>>>> +will return the number of bytes copied between files.
> >>>>>> +This could be less than the length originally requested.
> >>>>>> +
> >>>>>> +On error,
> >>>>>> +.BR copy_file_range ()
> >>>>>> +returns \-1 and
> >>>>>> +.I errno
> >>>>>> +is set to indicate the error.
> >>>>>> +.SH ERRORS
> >>>>>> +.TP
> >>>>>> +.B EBADF
> >>>>>> +One or more file descriptors are not valid,
> >>>>>> +or do not have proper read-write mode.
> >>>>>
> >>>>> "or fd_out is not opened for writing"?
> >>>>
> >>>> I'll add that.
> >>>>
> >>>>>
> >>>>>> +.TP
> >>>>>> +.B EINVAL
> >>>>>> +Requested range extends beyond the end of the file;
> >>>>>> +.I flags
> >>>>>> +argument is set to an invalid value.
> >>>>>> +.TP
> >>>>>> +.B EOPNOTSUPP
> >>>>>> +.B COPY_REFLINK
> >>>>>> +was specified in
> >>>>>> +.IR flags ,
> >>>>>> +but the target filesystem does not support reflinks.
> >>>>>> +.TP
> >>>>>> +.B EXDEV
> >>>>>> +Target filesystem doesn't support cross-filesystem copies.
> >>>>>> +.SH VERSIONS
> >>>>>
> >>>>> Perhaps this ought to list a few more errors (EIO, ENOSPC, ENOSYS, EPERM...)
> >>>>> that can be returned?  (I was looking at the fallocate manpage.)
> >>>>
> >>>> Okay.  I'll poke around for what else could be returned!
> >>>>
> >>>> Thanks,
> >>>> Anna
> >>>>
> >>>>>
> >>>>> --D
> >>>>>
> >>>>>> +The
> >>>>>> +.BR copy_file_range ()
> >>>>>> +system call first appeared in Linux 4.3.
> >>>>>> +.SH CONFORMING TO
> >>>>>> +The
> >>>>>> +.BR copy_file_range ()
> >>>>>> +system call is a nonstandard Linux extension.
> >>>>>> +.SH EXAMPLE
> >>>>>> +.nf
> >>>>>> +
> >>>>>> +#define _GNU_SOURCE
> >>>>>> +#include <fcntl.h>
> >>>>>> +#include <linux/copy.h>
> >>>>>> +#include <stdio.h>
> >>>>>> +#include <stdlib.h>
> >>>>>> +#include <sys/stat.h>
> >>>>>> +#include <sys/syscall.h>
> >>>>>> +#include <unistd.h>
> >>>>>> +
> >>>>>> +
> >>>>>> +int main(int argc, char **argv)
> >>>>>> +{
> >>>>>> +    int fd_in, fd_out;
> >>>>>> +    struct stat stat;
> >>>>>> +    loff_t len, ret;
> >>>>>> +
> >>>>>> +    if (argc != 3) {
> >>>>>> +        fprintf(stderr, "Usage: %s <pathname> <pathname>\n", argv[0]);
> >>>>>> +        exit(EXIT_FAILURE);
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    fd_in = open(argv[1], O_RDONLY);
> >>>>>> +    if (fd_in == -1) {
> >>>>>> +        perror("open (argv[1])");
> >>>>>> +        exit(EXIT_FAILURE);
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    if (fstat(fd_in, &stat) == -1) {
> >>>>>> +        perror("fstat");
> >>>>>> +        exit(EXIT_FAILURE);
> >>>>>> +    }
> >>>>>> +    len = stat.st_size;
> >>>>>> +
> >>>>>> +    fd_out = open(argv[2], O_WRONLY | O_CREAT, 0644);
> >>>>>> +    if (fd_out == -1) {
> >>>>>> +        perror("open (argv[2])");
> >>>>>> +        exit(EXIT_FAILURE);
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    do {
> >>>>>> +        ret = syscall(__NR_copy_file_range, fd_in, NULL,
> >>>>>> +                      fd_out, NULL, len, 0);
> >>>>>> +        if (ret == -1) {
> >>>>>> +            perror("copy_file_range");
> >>>>>> +            exit(EXIT_FAILURE);
> >>>>>> +        }
> >>>>>> +
> >>>>>> +        len -= ret;
> >>>>>> +    } while (len > 0);
> >>>>>> +
> >>>>>> +    close(fd_in);
> >>>>>> +    close(fd_out);
> >>>>>> +    exit(EXIT_SUCCESS);
> >>>>>> +}
> >>>>>> +.fi
> >>>>>> +.SH SEE ALSO
> >>>>>> +.BR splice (2)
> >>>>>> --
> >>>>>> 2.5.1
> >>>>>>
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >>
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux