On 09/04/2015 05:38 PM, Darrick J. Wong wrote: > On Fri, Sep 04, 2015 at 04:17:03PM -0400, Anna Schumaker wrote: >> copy_file_range() is a new system call for copying ranges of data >> completely in the kernel. This gives filesystems an opportunity to >> implement some kind of "copy acceleration", such as reflinks or >> server-side-copy (in the case of NFS). >> >> Signed-off-by: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx> >> --- >> man2/copy_file_range.2 | 168 +++++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 168 insertions(+) >> create mode 100644 man2/copy_file_range.2 >> >> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2 >> new file mode 100644 >> index 0000000..4a4cb73 >> --- /dev/null >> +++ b/man2/copy_file_range.2 >> @@ -0,0 +1,168 @@ >> +.\"This manpage is Copyright (C) 2015 Anna Schumaker <Anna.Schumaker@xxxxxxxxxx> >> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual" >> +.SH NAME >> +copy_file_range \- Copy a range of data from one file to another >> +.SH SYNOPSIS >> +.nf >> +.B #include <linux/copy.h> >> +.B #include <sys/syscall.h> >> +.B #include <unistd.h> >> + >> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ", >> +.BI " int " fd_out ", loff_t * " off_out ", size_t " len ", >> +.BI " unsigned int " flags ); >> +.fi >> +.SH DESCRIPTION >> +The >> +.BR copy_file_range () >> +system call performs an in-kernel copy between two file descriptors >> +without all that tedious mucking about in userspace. > > ;) > >> +It copies up to >> +.I len >> +bytes of data from file descriptor >> +.I fd_in >> +to file descriptor >> +.I fd_out >> +at >> +.IR off_out . >> +The file descriptors must not refer to the same file. > > Why? btrfs (and XFS) reflink can handle the case of a file sharing blocks > with itself. I've never really thought about it... Zach had that in his initial submission, so mentioned it in the man page. Should I remove that bit? > >> + >> +The following semantics apply for >> +.IR fd_in , >> +and similar statements apply to >> +.IR off_out : >> +.IP * 3 >> +If >> +.I off_in >> +is NULL, then bytes are read from >> +.I fd_in >> +starting from the current file offset and the current >> +file offset is adjusted appropriately. >> +.IP * >> +If >> +.I off_in >> +is not NULL, then >> +.I off_in >> +must point to a buffer that specifies the starting >> +offset where bytes from >> +.I fd_in >> +will be read. The current file offset of >> +.I fd_in >> +is not changed, but >> +.I off_in >> +is adjusted appropriately. >> +.PP >> +The default behavior of >> +.BR copy_file_range () >> +is filesystem specific, and might result in creating a >> +copy-on-write reflink. >> +In the event that a given filesystem does not implement >> +any form of copy acceleration, the kernel will perform >> +a deep copy of the requested range by reading bytes from > > I wonder if it's wise to allow deep copies -- what happens if len == 1T? > Will this syscall just block for a really long time? We use rw_verify_area(), (similar to read and write) so we won't allow a value of len that long. I can mention this in an updated version of this man page! > >> +.I fd_in >> +and writing them to >> +.IR fd_out . > > "...if COPY_REFLINK is not set in flags." Sure. > >> + >> +Currently, Linux only supports the following flag: >> +.TP 1.9i >> +.B COPY_REFLINK >> +Only perform the copy if the filesystem can do it as a reflink. >> +Do not fall back on performing a deep copy. >> +.SH RETURN VALUE >> +Upon successful completion, >> +.BR copy_file_range () >> +will return the number of bytes copied between files. >> +This could be less than the length originally requested. >> + >> +On error, >> +.BR copy_file_range () >> +returns \-1 and >> +.I errno >> +is set to indicate the error. >> +.SH ERRORS >> +.TP >> +.B EBADF >> +One or more file descriptors are not valid, >> +or do not have proper read-write mode. > > "or fd_out is not opened for writing"? I'll add that. > >> +.TP >> +.B EINVAL >> +Requested range extends beyond the end of the file; >> +.I flags >> +argument is set to an invalid value. >> +.TP >> +.B EOPNOTSUPP >> +.B COPY_REFLINK >> +was specified in >> +.IR flags , >> +but the target filesystem does not support reflinks. >> +.TP >> +.B EXDEV >> +Target filesystem doesn't support cross-filesystem copies. >> +.SH VERSIONS > > Perhaps this ought to list a few more errors (EIO, ENOSPC, ENOSYS, EPERM...) > that can be returned? (I was looking at the fallocate manpage.) Okay. I'll poke around for what else could be returned! Thanks, Anna > > --D > >> +The >> +.BR copy_file_range () >> +system call first appeared in Linux 4.3. >> +.SH CONFORMING TO >> +The >> +.BR copy_file_range () >> +system call is a nonstandard Linux extension. >> +.SH EXAMPLE >> +.nf >> + >> +#define _GNU_SOURCE >> +#include <fcntl.h> >> +#include <linux/copy.h> >> +#include <stdio.h> >> +#include <stdlib.h> >> +#include <sys/stat.h> >> +#include <sys/syscall.h> >> +#include <unistd.h> >> + >> + >> +int main(int argc, char **argv) >> +{ >> + int fd_in, fd_out; >> + struct stat stat; >> + loff_t len, ret; >> + >> + if (argc != 3) { >> + fprintf(stderr, "Usage: %s <pathname> <pathname>\n", argv[0]); >> + exit(EXIT_FAILURE); >> + } >> + >> + fd_in = open(argv[1], O_RDONLY); >> + if (fd_in == -1) { >> + perror("open (argv[1])"); >> + exit(EXIT_FAILURE); >> + } >> + >> + if (fstat(fd_in, &stat) == -1) { >> + perror("fstat"); >> + exit(EXIT_FAILURE); >> + } >> + len = stat.st_size; >> + >> + fd_out = open(argv[2], O_WRONLY | O_CREAT, 0644); >> + if (fd_out == -1) { >> + perror("open (argv[2])"); >> + exit(EXIT_FAILURE); >> + } >> + >> + do { >> + ret = syscall(__NR_copy_file_range, fd_in, NULL, >> + fd_out, NULL, len, 0); >> + if (ret == -1) { >> + perror("copy_file_range"); >> + exit(EXIT_FAILURE); >> + } >> + >> + len -= ret; >> + } while (len > 0); >> + >> + close(fd_in); >> + close(fd_out); >> + exit(EXIT_SUCCESS); >> +} >> +.fi >> +.SH SEE ALSO >> +.BR splice (2) >> -- >> 2.5.1 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html