On Tue, Sep 08, 2015 at 11:04:03AM -0400, Anna Schumaker wrote: > On 09/04/2015 05:38 PM, Darrick J. Wong wrote: > > On Fri, Sep 04, 2015 at 04:17:03PM -0400, Anna Schumaker wrote: > >> copy_file_range() is a new system call for copying ranges of data > >> completely in the kernel. This gives filesystems an opportunity to > >> implement some kind of "copy acceleration", such as reflinks or > >> server-side-copy (in the case of NFS). > >> > >> Signed-off-by: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx> > >> --- > >> man2/copy_file_range.2 | 168 +++++++++++++++++++++++++++++++++++++++++++++++++ > >> 1 file changed, 168 insertions(+) > >> create mode 100644 man2/copy_file_range.2 > >> > >> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2 > >> new file mode 100644 > >> index 0000000..4a4cb73 > >> --- /dev/null > >> +++ b/man2/copy_file_range.2 > >> @@ -0,0 +1,168 @@ > >> +.\"This manpage is Copyright (C) 2015 Anna Schumaker <Anna.Schumaker@xxxxxxxxxx> > >> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual" > >> +.SH NAME > >> +copy_file_range \- Copy a range of data from one file to another > >> +.SH SYNOPSIS > >> +.nf > >> +.B #include <linux/copy.h> > >> +.B #include <sys/syscall.h> > >> +.B #include <unistd.h> > >> + > >> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ", > >> +.BI " int " fd_out ", loff_t * " off_out ", size_t " len ", > >> +.BI " unsigned int " flags ); > >> +.fi > >> +.SH DESCRIPTION > >> +The > >> +.BR copy_file_range () > >> +system call performs an in-kernel copy between two file descriptors > >> +without all that tedious mucking about in userspace. > > > > ;) > > > >> +It copies up to > >> +.I len > >> +bytes of data from file descriptor > >> +.I fd_in > >> +to file descriptor > >> +.I fd_out > >> +at > >> +.IR off_out . > >> +The file descriptors must not refer to the same file. > > > > Why? btrfs (and XFS) reflink can handle the case of a file sharing blocks > > with itself. > > I've never really thought about it... Zach had that in his initial > submission, so mentioned it in the man page. Should I remove that bit? Yes, please! I could be wrong, but I think btrfs only started supporting files that share blocks with themselves relatively recently(?) I'm not sure why zab added this; was hoping he'd speak up. ;) > > > > >> + > >> +The following semantics apply for > >> +.IR fd_in , > >> +and similar statements apply to > >> +.IR off_out : > >> +.IP * 3 > >> +If > >> +.I off_in > >> +is NULL, then bytes are read from > >> +.I fd_in > >> +starting from the current file offset and the current > >> +file offset is adjusted appropriately. > >> +.IP * > >> +If > >> +.I off_in > >> +is not NULL, then > >> +.I off_in > >> +must point to a buffer that specifies the starting > >> +offset where bytes from > >> +.I fd_in > >> +will be read. The current file offset of > >> +.I fd_in > >> +is not changed, but > >> +.I off_in > >> +is adjusted appropriately. > >> +.PP > >> +The default behavior of > >> +.BR copy_file_range () > >> +is filesystem specific, and might result in creating a > >> +copy-on-write reflink. > >> +In the event that a given filesystem does not implement > >> +any form of copy acceleration, the kernel will perform > >> +a deep copy of the requested range by reading bytes from > > > > I wonder if it's wise to allow deep copies -- what happens if len == 1T? > > Will this syscall just block for a really long time? > > We use rw_verify_area(), (similar to read and write) so we won't allow a > value of len that long. I can mention this in an updated version of this man > page! Ok. I guess MAX_RW_COUNT limits us to about 4G at once, which for a splice copy is probably reasonable. The reason why I asked about len == 1T specifically is that I can (with somewhat long delays) reflink about 260 million extents at a time on XFS, which is about 1TB. Given that locks get held for the duration, it's probably not a bad thing to limit userspace to 4G at a time. (But hey, it's fun to stress-test once in a while. :)) --D > > > > > >> +.I fd_in > >> +and writing them to > >> +.IR fd_out . > > > > "...if COPY_REFLINK is not set in flags." > > Sure. > > > > >> + > >> +Currently, Linux only supports the following flag: > >> +.TP 1.9i > >> +.B COPY_REFLINK > >> +Only perform the copy if the filesystem can do it as a reflink. > >> +Do not fall back on performing a deep copy. > >> +.SH RETURN VALUE > >> +Upon successful completion, > >> +.BR copy_file_range () > >> +will return the number of bytes copied between files. > >> +This could be less than the length originally requested. > >> + > >> +On error, > >> +.BR copy_file_range () > >> +returns \-1 and > >> +.I errno > >> +is set to indicate the error. > >> +.SH ERRORS > >> +.TP > >> +.B EBADF > >> +One or more file descriptors are not valid, > >> +or do not have proper read-write mode. > > > > "or fd_out is not opened for writing"? > > I'll add that. > > > > >> +.TP > >> +.B EINVAL > >> +Requested range extends beyond the end of the file; > >> +.I flags > >> +argument is set to an invalid value. > >> +.TP > >> +.B EOPNOTSUPP > >> +.B COPY_REFLINK > >> +was specified in > >> +.IR flags , > >> +but the target filesystem does not support reflinks. > >> +.TP > >> +.B EXDEV > >> +Target filesystem doesn't support cross-filesystem copies. > >> +.SH VERSIONS > > > > Perhaps this ought to list a few more errors (EIO, ENOSPC, ENOSYS, EPERM...) > > that can be returned? (I was looking at the fallocate manpage.) > > Okay. I'll poke around for what else could be returned! > > Thanks, > Anna > > > > > --D > > > >> +The > >> +.BR copy_file_range () > >> +system call first appeared in Linux 4.3. > >> +.SH CONFORMING TO > >> +The > >> +.BR copy_file_range () > >> +system call is a nonstandard Linux extension. > >> +.SH EXAMPLE > >> +.nf > >> + > >> +#define _GNU_SOURCE > >> +#include <fcntl.h> > >> +#include <linux/copy.h> > >> +#include <stdio.h> > >> +#include <stdlib.h> > >> +#include <sys/stat.h> > >> +#include <sys/syscall.h> > >> +#include <unistd.h> > >> + > >> + > >> +int main(int argc, char **argv) > >> +{ > >> + int fd_in, fd_out; > >> + struct stat stat; > >> + loff_t len, ret; > >> + > >> + if (argc != 3) { > >> + fprintf(stderr, "Usage: %s <pathname> <pathname>\n", argv[0]); > >> + exit(EXIT_FAILURE); > >> + } > >> + > >> + fd_in = open(argv[1], O_RDONLY); > >> + if (fd_in == -1) { > >> + perror("open (argv[1])"); > >> + exit(EXIT_FAILURE); > >> + } > >> + > >> + if (fstat(fd_in, &stat) == -1) { > >> + perror("fstat"); > >> + exit(EXIT_FAILURE); > >> + } > >> + len = stat.st_size; > >> + > >> + fd_out = open(argv[2], O_WRONLY | O_CREAT, 0644); > >> + if (fd_out == -1) { > >> + perror("open (argv[2])"); > >> + exit(EXIT_FAILURE); > >> + } > >> + > >> + do { > >> + ret = syscall(__NR_copy_file_range, fd_in, NULL, > >> + fd_out, NULL, len, 0); > >> + if (ret == -1) { > >> + perror("copy_file_range"); > >> + exit(EXIT_FAILURE); > >> + } > >> + > >> + len -= ret; > >> + } while (len > 0); > >> + > >> + close(fd_in); > >> + close(fd_out); > >> + exit(EXIT_SUCCESS); > >> +} > >> +.fi > >> +.SH SEE ALSO > >> +.BR splice (2) > >> -- > >> 2.5.1 > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html