Re: [PATCH v1 9/8] copy_file_range.2: New page documenting copy_file_range()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 09/09/2015 01:17 PM, Darrick J. Wong wrote:
> On Wed, Sep 09, 2015 at 07:38:14AM -0400, Austin S Hemmelgarn wrote:
>> On 2015-09-08 16:39, Darrick J. Wong wrote:
>>> On Tue, Sep 08, 2015 at 11:04:03AM -0400, Anna Schumaker wrote:
>>>> On 09/04/2015 05:38 PM, Darrick J. Wong wrote:
>>>>> On Fri, Sep 04, 2015 at 04:17:03PM -0400, Anna Schumaker wrote:
>>>>>> copy_file_range() is a new system call for copying ranges of data
>>>>>> completely in the kernel.  This gives filesystems an opportunity to
>>>>>> implement some kind of "copy acceleration", such as reflinks or
>>>>>> server-side-copy (in the case of NFS).
>>>>>>
>>>>>> Signed-off-by: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx>
>>>>>> ---
>>>>>>  man2/copy_file_range.2 | 168 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>  1 file changed, 168 insertions(+)
>>>>>>  create mode 100644 man2/copy_file_range.2
>>>>>>
>>>>>> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
>>>>>> new file mode 100644
>>>>>> index 0000000..4a4cb73
>>>>>> --- /dev/null
>>>>>> +++ b/man2/copy_file_range.2
>>>>>> @@ -0,0 +1,168 @@
>>>>>> +.\"This manpage is Copyright (C) 2015 Anna Schumaker <Anna.Schumaker@xxxxxxxxxx>
>>>>>> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual"
>>>>>> +.SH NAME
>>>>>> +copy_file_range \- Copy a range of data from one file to another
>>>>>> +.SH SYNOPSIS
>>>>>> +.nf
>>>>>> +.B #include <linux/copy.h>
>>>>>> +.B #include <sys/syscall.h>
>>>>>> +.B #include <unistd.h>
>>>>>> +
>>>>>> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ",
>>>>>> +.BI "                int " fd_out ", loff_t * " off_out ", size_t " len ",
>>>>>> +.BI "                unsigned int " flags );
>>>>>> +.fi
>>>>>> +.SH DESCRIPTION
>>>>>> +The
>>>>>> +.BR copy_file_range ()
>>>>>> +system call performs an in-kernel copy between two file descriptors
>>>>>> +without all that tedious mucking about in userspace.
>>>>>
>>>>> ;)
>>>>>
>>>>>> +It copies up to
>>>>>> +.I len
>>>>>> +bytes of data from file descriptor
>>>>>> +.I fd_in
>>>>>> +to file descriptor
>>>>>> +.I fd_out
>>>>>> +at
>>>>>> +.IR off_out .
>>>>>> +The file descriptors must not refer to the same file.
>>>>>
>>>>> Why?  btrfs (and XFS) reflink can handle the case of a file sharing blocks
>>>>> with itself.
>>>>
>>>> I've never really thought about it... Zach had that in his initial
>>>> submission, so mentioned it in the man page.  Should I remove that bit?
>>>
>>> Yes, please!
>>>
>>> I could be wrong, but I think btrfs only started supporting files that share
>>> blocks with themselves relatively recently(?)
>>>
>>> I'm not sure why zab added this; was hoping he'd speak up. ;)
>>>
>>>>
>>>>>
>>>>>> +
>>>>>> +The following semantics apply for
>>>>>> +.IR fd_in ,
>>>>>> +and similar statements apply to
>>>>>> +.IR off_out :
>>>>>> +.IP * 3
>>>>>> +If
>>>>>> +.I off_in
>>>>>> +is NULL, then bytes are read from
>>>>>> +.I fd_in
>>>>>> +starting from the current file offset and the current
>>>>>> +file offset is adjusted appropriately.
>>>>>> +.IP *
>>>>>> +If
>>>>>> +.I off_in
>>>>>> +is not NULL, then
>>>>>> +.I off_in
>>>>>> +must point to a buffer that specifies the starting
>>>>>> +offset where bytes from
>>>>>> +.I fd_in
>>>>>> +will be read.  The current file offset of
>>>>>> +.I fd_in
>>>>>> +is not changed, but
>>>>>> +.I off_in
>>>>>> +is adjusted appropriately.
>>>>>> +.PP
>>>>>> +The default behavior of
>>>>>> +.BR copy_file_range ()
>>>>>> +is filesystem specific, and might result in creating a
>>>>>> +copy-on-write reflink.
>>>>>> +In the event that a given filesystem does not implement
>>>>>> +any form of copy acceleration, the kernel will perform
>>>>>> +a deep copy of the requested range by reading bytes from
>>>>>
>>>>> I wonder if it's wise to allow deep copies -- what happens if len == 1T?
>>>>> Will this syscall just block for a really long time?
>>>>
>>>> We use rw_verify_area(), (similar to read and write) so we won't allow a
>>>> value of len that long.  I can mention this in an updated version of this man
>>>> page!
>>>
>>> Ok.  I guess MAX_RW_COUNT limits us to about 4G at once, which for a splice
> 
> Heh, INT_MAX, so 2GB at once.
> 
>>> copy is probably reasonable.
>>>
>>> The reason why I asked about len == 1T specifically is that I can (with
>>> somewhat long delays) reflink about 260 million extents at a time on XFS,
>>> which is about 1TB.  Given that locks get held for the duration, it's probably
>>> not a bad thing to limit userspace to 4G at a time.
>>
>> I'd personally love to see that be tunable by a sysctl (kind of like
>> how you can control the maximum number of AIO requests in flight),
>> and for that matter we might want to be able to limit the number of
>> in-progress copies going on.
> 
> Now that I think about it, btrfs' reflink ioctl doesn't seem to have any
> particular limit on how much you can reflink in a single call.  XFS doesn't
> have a limit either.  Given that reflink should create a tiny amount of IO
> compared to the number of bytes being manipulated, should we allow a higher
> limit when ssize_t is large enough?
> 
> Copy-through-the-pagecache should stick to MAX_RW_COUNT.

Should I keep rejecting pagecache copies if len > MAX_RW_COUNT?  Or would it be okay to change the value of len to MAX_RW_COUNT in this case?

Anna

> 
> I noticed that btrfs won't dedupe more than 16M per call.  Any thoughts?
> 
> --D
> 
>>>
>>> (But hey, it's fun to stress-test once in a while. :))
>>>
>>> --D
>>>
>>>>
>>>>
>>>>>
>>>>>> +.I fd_in
>>>>>> +and writing them to
>>>>>> +.IR fd_out .
>>>>>
>>>>> "...if COPY_REFLINK is not set in flags."
>>>>
>>>> Sure.
>>>>
>>>>>
>>>>>> +
>>>>>> +Currently, Linux only supports the following flag:
>>>>>> +.TP 1.9i
>>>>>> +.B COPY_REFLINK
>>>>>> +Only perform the copy if the filesystem can do it as a reflink.
>>>>>> +Do not fall back on performing a deep copy.
>>>>>> +.SH RETURN VALUE
>>>>>> +Upon successful completion,
>>>>>> +.BR copy_file_range ()
>>>>>> +will return the number of bytes copied between files.
>>>>>> +This could be less than the length originally requested.
>>>>>> +
>>>>>> +On error,
>>>>>> +.BR copy_file_range ()
>>>>>> +returns \-1 and
>>>>>> +.I errno
>>>>>> +is set to indicate the error.
>>>>>> +.SH ERRORS
>>>>>> +.TP
>>>>>> +.B EBADF
>>>>>> +One or more file descriptors are not valid,
>>>>>> +or do not have proper read-write mode.
>>>>>
>>>>> "or fd_out is not opened for writing"?
>>>>
>>>> I'll add that.
>>>>
>>>>>
>>>>>> +.TP
>>>>>> +.B EINVAL
>>>>>> +Requested range extends beyond the end of the file;
>>>>>> +.I flags
>>>>>> +argument is set to an invalid value.
>>>>>> +.TP
>>>>>> +.B EOPNOTSUPP
>>>>>> +.B COPY_REFLINK
>>>>>> +was specified in
>>>>>> +.IR flags ,
>>>>>> +but the target filesystem does not support reflinks.
>>>>>> +.TP
>>>>>> +.B EXDEV
>>>>>> +Target filesystem doesn't support cross-filesystem copies.
>>>>>> +.SH VERSIONS
>>>>>
>>>>> Perhaps this ought to list a few more errors (EIO, ENOSPC, ENOSYS, EPERM...)
>>>>> that can be returned?  (I was looking at the fallocate manpage.)
>>>>
>>>> Okay.  I'll poke around for what else could be returned!
>>>>
>>>> Thanks,
>>>> Anna
>>>>
>>>>>
>>>>> --D
>>>>>
>>>>>> +The
>>>>>> +.BR copy_file_range ()
>>>>>> +system call first appeared in Linux 4.3.
>>>>>> +.SH CONFORMING TO
>>>>>> +The
>>>>>> +.BR copy_file_range ()
>>>>>> +system call is a nonstandard Linux extension.
>>>>>> +.SH EXAMPLE
>>>>>> +.nf
>>>>>> +
>>>>>> +#define _GNU_SOURCE
>>>>>> +#include <fcntl.h>
>>>>>> +#include <linux/copy.h>
>>>>>> +#include <stdio.h>
>>>>>> +#include <stdlib.h>
>>>>>> +#include <sys/stat.h>
>>>>>> +#include <sys/syscall.h>
>>>>>> +#include <unistd.h>
>>>>>> +
>>>>>> +
>>>>>> +int main(int argc, char **argv)
>>>>>> +{
>>>>>> +    int fd_in, fd_out;
>>>>>> +    struct stat stat;
>>>>>> +    loff_t len, ret;
>>>>>> +
>>>>>> +    if (argc != 3) {
>>>>>> +        fprintf(stderr, "Usage: %s <pathname> <pathname>\n", argv[0]);
>>>>>> +        exit(EXIT_FAILURE);
>>>>>> +    }
>>>>>> +
>>>>>> +    fd_in = open(argv[1], O_RDONLY);
>>>>>> +    if (fd_in == -1) {
>>>>>> +        perror("open (argv[1])");
>>>>>> +        exit(EXIT_FAILURE);
>>>>>> +    }
>>>>>> +
>>>>>> +    if (fstat(fd_in, &stat) == -1) {
>>>>>> +        perror("fstat");
>>>>>> +        exit(EXIT_FAILURE);
>>>>>> +    }
>>>>>> +    len = stat.st_size;
>>>>>> +
>>>>>> +    fd_out = open(argv[2], O_WRONLY | O_CREAT, 0644);
>>>>>> +    if (fd_out == -1) {
>>>>>> +        perror("open (argv[2])");
>>>>>> +        exit(EXIT_FAILURE);
>>>>>> +    }
>>>>>> +
>>>>>> +    do {
>>>>>> +        ret = syscall(__NR_copy_file_range, fd_in, NULL,
>>>>>> +                      fd_out, NULL, len, 0);
>>>>>> +        if (ret == -1) {
>>>>>> +            perror("copy_file_range");
>>>>>> +            exit(EXIT_FAILURE);
>>>>>> +        }
>>>>>> +
>>>>>> +        len -= ret;
>>>>>> +    } while (len > 0);
>>>>>> +
>>>>>> +    close(fd_in);
>>>>>> +    close(fd_out);
>>>>>> +    exit(EXIT_SUCCESS);
>>>>>> +}
>>>>>> +.fi
>>>>>> +.SH SEE ALSO
>>>>>> +.BR splice (2)
>>>>>> --
>>>>>> 2.5.1
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux