On 09/14/2015 02:32 PM, Darrick J. Wong wrote: > On Sun, Sep 13, 2015 at 09:50:18AM +0200, Michael Kerrisk (man-pages) wrote: >> Hi Anna, >> >> On 09/11/2015 10:30 PM, Anna Schumaker wrote: >>> copy_file_range() is a new system call for copying ranges of data >>> completely in the kernel. This gives filesystems an opportunity to >>> implement some kind of "copy acceleration", such as reflinks or >>> server-side-copy (in the case of NFS). >>> >>> Signed-off-by: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx> >> >> Thanks for writing such a nice first draft man page! I have a few >> comments below. Would you be willing to revise and resubmit? >> >>> --- >>> man2/copy_file_range.2 | 188 +++++++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 188 insertions(+) >>> create mode 100644 man2/copy_file_range.2 >>> >>> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2 >>> new file mode 100644 >>> index 0000000..84912b5 >>> --- /dev/null >>> +++ b/man2/copy_file_range.2 >>> @@ -0,0 +1,188 @@ >>> +.\"This manpage is Copyright (C) 2015 Anna Schumaker <Anna.Schumaker@xxxxxxxxxx> >> >> We need a license for this page. Please see >> https://www.kernel.org/doc/man-pages/licenses.html >> >>> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual" >> >> Make the month 2 digits (leading 0). >> >>> +.SH NAME >>> +copy_file_range \- Copy a range of data from one file to another >>> +.SH SYNOPSIS >>> +.nf >>> +.B #include <linux/copy.h> >>> +.B #include <sys/syscall.h> >>> +.B #include <unistd.h> >>> + >>> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ", >>> +.BI " int " fd_out ", loff_t * " off_out ", size_t " len ", >>> +.BI " unsigned int " flags ); >> >> Remove spaces following "*" in the lines above. (man-pages convention for code) >> >> I know that the copy_file_range() (obviously) doesn't yet have a wrapper >> in glibc, but in the man pages we document all system calls as though there >> is a wrapper. See, for example, gettid(2), for an axample of how this is done >> (a note in the SYNOPSIS and then some further details in NOTES). >> >>> +.fi >>> +.SH DESCRIPTION >>> +The >>> +.BR copy_file_range () >>> +system call performs an in-kernel copy between two file descriptors >>> +without all that tedious mucking about in userspace. >> >> I'd write that last piece as: >> >> "without the cost of (a loop) transferring data from the kernel to a >> user-space buffer and then back to the kernel again. >> >>> +It copies up to >>> +.I len >>> +bytes of data from file descriptor >>> +.I fd_in >>> +to file descriptor >>> +.IR fd_out , >>> +overwriting any data that exists within the requested range. >> >> s/.$/ of the target file./ >> >>> + >>> +The following semantics apply for >>> +.IR off_in , >>> +and similar statements apply to >>> +.IR off_out : >>> +.IP * 3 >>> +If >>> +.I off_in >>> +is NULL, then bytes are read from >>> +.I fd_in >>> +starting from the current file offset and the current >>> +file offset is adjusted appropriately. >>> +.IP * >>> +If >>> +.I off_in >>> +is not NULL, then >>> +.I off_in >>> +must point to a buffer that specifies the starting >>> +offset where bytes from >>> +.I fd_in >>> +will be read. The current file offset of >>> +.I fd_in >>> +is not changed, but >>> +.I off_in >>> +is adjusted appropriately. >>> +.PP >>> + >>> +The >>> +.I flags >>> +argument is a bit mask composed by OR-ing together zero >>> +or more of the following flags: >>> +.TP 1.9i >>> +.B COPY_FR_COPY >>> +Copy all the file data in the requested range. >>> +Some filesystems, like NFS, might be able to accelerate this copy >>> +to avoid unnecessary data transfers. >>> +.TP >>> +.B COPY_FR_REFLINK >>> +Create a lightweight "reflink", where data is not copied until >>> +one of the files is modified. >>> +.PP >>> +The default behavior >>> +.RI ( flags >>> +== 0) is to try creating a reflink, >>> +and if reflinking fails >>> +.BR copy_file_range () >>> +will fall back on performing a full data copy. >> >> s/back on/back to/ >> >>> +This is equivalent to setting >>> +.I flags >>> +equal to >>> +.RB ( COPY_FR_COPY | COPY_FR_REFLINK ). >> >> So, from an API deign perspective, the interoperation of these two >> flags appears odd. Bit flags are commonly (not always) orthogonal. >> I think here it could be pointed out a little more explicitly that >> these two flags are not orthogonal. In particular, perhaps after the >> last sentence, you could add another sentence: >> >> [[ >> (This contrasts with specifying >> .I flags >> as just >> .BR COPY_FR_REFLINK , >> which causes the call to create a reflink, >> and fail if that is not possible, >> rather than falling back to a full data copy.) >> ]] >> >> Furthermore, I even wonder if explicitly specifying flags as >> COPY_FR_COPY | COPY_FR_REFLINK should just generate an EINVAL >> error. 0 already gives us the behavior described above, >> and allowing the combination COPY_FR_COPY | COPY_FR_REFLINK >> perhaps just contributes to misleading the user that these >> flags are orthogonal, when in reality they are not. What do >> you think? > > Personally, I think it's a little weird that one turns on reflink with a flag; > turns on regular copy with a different flag; and turns on both by not > specifying either flag. :) Is there a better behavior for flags=0? I was thinking this would be what people want when they don't care how the copy happens in the kernel. Anna > >> What are the semantics with respect to signals, especially if data >> copying a very large file range? This info should be included in the >> man page, probably under NOTES. >> >>> +.SH RETURN VALUE >>> +Upon successful completion, >>> +.BR copy_file_range () >>> +will return the number of bytes copied between files. >>> +This could be less than the length originally requested. >>> + >>> +On error, >>> +.BR copy_file_range () >>> +returns \-1 and >>> +.I errno >>> +is set to indicate the error. >>> +.SH ERRORS >>> +.TP >>> +.B EBADF >>> +One or more file descriptors are not valid, >>> +or do not have proper read-write mode; >> >> I think that last line can go. I mean, isn't this point (again) >> covered in the next few lines? > > Or change the ';' to a ':'? > >>> +.I fd_in >>> +is not open for reading; or >>> +.I fd_out >>> +is not open for writing. >>> +.TP >>> +.B EINVAL >>> +Requested range extends beyond the end of the file; or the >> >> s/file/source file/ (right?) >> >>> +.I flags >>> +argument is set to an invalid value. >>> +.TP >>> +.B EIO >>> +A low level I/O error occurred while copying. >>> +.TP >>> +.B ENOMEM >>> +Out of memory. >>> +.TP >>> +.B ENOSPC >>> +There is not enough space to complete the copy. >> >> Space where? On the filesystem? >> => s/space/space on the target filesystem/ >> >>> +.TP >>> +.B EOPNOTSUPP >>> +.B COPY_REFLINK >>> +was specified in >>> +.IR flags , >>> +but the target filesystem does not support reflinks. >>> +.TP >>> +.B EXDEV >>> +Target filesystem doesn't support cross-filesystem copies. >> >> I'm curious. What are some examples of filesystems that produce this >> error? > > btrfs and xfs (and probably most local filesystems) don't support cross-fs > copies. > > --D > >> >>> +.SH VERSIONS >>> +The >>> +.BR copy_file_range () >>> +system call first appeared in Linux 4.4. >>> +.SH CONFORMING TO >>> +The >>> +.BR copy_file_range () >>> +system call is a nonstandard Linux extension. >>> +.SH EXAMPLE >>> +.nf >>> + >>> +#define _GNU_SOURCE >>> +#include <fcntl.h> >>> +#include <linux/copy.h> >>> +#include <stdio.h> >>> +#include <stdlib.h> >>> +#include <sys/stat.h> >>> +#include <sys/syscall.h> >>> +#include <unistd.h> >>> + >>> + >>> +int main(int argc, char **argv) >>> +{ >>> + int fd_in, fd_out; >>> + struct stat stat; >>> + loff_t len, ret; >>> + >>> + if (argc != 3) { >>> + fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]); >>> + exit(EXIT_FAILURE); >>> + } >>> + >>> + fd_in = open(argv[1], O_RDONLY); >>> + if (fd_in == -1) { >> >> Please replace all "-" in code by "\-". (This is a groff >> detail.) >> >>> + perror("open (argv[1])"); >>> + exit(EXIT_FAILURE); >>> + } >>> + >>> + if (fstat(fd_in, &stat) == -1) { >>> + perror("fstat"); >>> + exit(EXIT_FAILURE); >>> + } >>> + len = stat.st_size; >>> + >>> + fd_out = creat(argv[2], 0644); >> >> These days, I think we should discourage the use of creat(). Maybe >> better to use open() instead here? >> >>> + if (fd_out == -1) { >>> + perror("creat (argv[2])"); >>> + exit(EXIT_FAILURE); >>> + } >>> + >>> + do { >>> + ret = syscall(__NR_copy_file_range, fd_in, NULL, >>> + fd_out, NULL, len, 0); >> >> I'd rather see this as: >> >> int >> copy_file_range(int fd_in, loff_t *off_in, >> int fd_out, loff_t *off_out, size_t len, >> unsigned int flags) >> { >> return(syscall(__NR_copy_file_range, fd_in, fd_out, off_out, len, flags); >> } >> >> ... >> >> copy_file_range(fd_in, fd_out, off_out, len, flags); >> >> >>> + if (ret == -1) { >>> + perror("copy_file_range"); >>> + exit(EXIT_FAILURE); >>> + } >>> + >>> + len -= ret; >>> + } while (len > 0); >>> + >>> + close(fd_in); >>> + close(fd_out); >>> + exit(EXIT_SUCCESS); >>> +} >>> +.fi >>> +.SH SEE ALSO >>> +.BR splice (2) >>> >> >> In the next iteration of this patch, could you include a change to >> splice(2) so that copy_file_range(2) is added under SEE ALSO in >> that page. Also, are there any other pages that we should like >> to/from? (sendfile(2)?) >> >> Thanks, >> >> Michael >> >> >> -- >> Michael Kerrisk >> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ >> Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html