Re: [PATCH v2 8/9] vfs: copy_file_range() can do a pagecache copy with splice

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2015-09-15 12:38, Darrick J. Wong wrote:
On Tue, Sep 15, 2015 at 11:58:04AM -0400, Anna Schumaker wrote:
On 09/14/2015 11:32 PM, Darrick J. Wong wrote:
On Fri, Sep 11, 2015 at 04:30:21PM -0400, Anna Schumaker wrote:
The NFS server will need some kind offallback for filesystems that don't
have any kind of copy acceleration, and it should be generally useful to
have an in-kernel copy to avoid lots of switches between kernel and user
space.

I make this configurable by adding two new flags.  Users who only want a
reflink can pass COPY_FR_REFLINK, and users who want a full data copy can
pass COPY_FR_COPY.  The default (flags=0) means to first attempt a
reflink, but use the pagecache if that fails.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx>
---
v2:
- Rename COPY_REFLINK -> COPY_FR_REFLINK
- Introduce COPY_FR_COPY flag
- Flags == 0 is really COPY_FR_COPY|COPY_FR_REFLINK
- Drop check for invalid flags
- Move call to do_splice_direct() into a new function
- Move rw_verify_area() checks into the new fallback function
---
  fs/read_write.c           | 56 ++++++++++++++++++++++++++++-------------------
  include/linux/copy.h      |  6 +++++
  include/uapi/linux/Kbuild |  1 +
  include/uapi/linux/copy.h |  7 ++++++
  4 files changed, 48 insertions(+), 22 deletions(-)
  create mode 100644 include/linux/copy.h
  create mode 100644 include/uapi/linux/copy.h

diff --git a/fs/read_write.c b/fs/read_write.c
index 363bd3e..ba24884 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -7,6 +7,7 @@
  #include <linux/slab.h>
  #include <linux/stat.h>
  #include <linux/fcntl.h>
+#include <linux/copy.h>
  #include <linux/file.h>
  #include <linux/uio.h>
  #include <linux/fsnotify.h>
@@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
  }
  #endif

+static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in,
+				       struct file *file_out, loff_t pos_out,
+				       size_t len)
+{
+	ssize_t ret;
+
+	ret = rw_verify_area(READ, file_in, &pos_in, len);
+	if (ret >= 0) {
+		len = ret;
+		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+		if (ret >= 0)
+			len = ret;
+	}
+	if (ret < 0)
+		return ret;
+
+	file_start_write(file_out);
+	ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0);
+	file_end_write(file_out);
+
+	return ret;
+}
+
  /*
   * copy_file_range() differs from regular file read and write in that it
   * specifically allows return partial success.  When it does so is up to
@@ -1338,34 +1362,17 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
  			    struct file *file_out, loff_t pos_out,
  			    size_t len, int flags)
  {
-	struct inode *inode_in;
-	struct inode *inode_out;
  	ssize_t ret;

-	if (flags)
-		return -EINVAL;
-
-	/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
-	ret = rw_verify_area(READ, file_in, &pos_in, len);
-	if (ret >= 0)
-		ret = rw_verify_area(WRITE, file_out, &pos_out, len);
-	if (ret < 0)
-		return ret;
+	if (flags == 0)
+		flags = COPY_FR_COPY | COPY_FR_REFLINK;

This function must return -EINVAL if any of the undefined flags bits are
set.

Sure, I'll add that.



  	if (!(file_in->f_mode & FMODE_READ) ||
  	    !(file_out->f_mode & FMODE_WRITE) ||
  	    (file_out->f_flags & O_APPEND) ||
-	    !file_out->f_op || !file_out->f_op->copy_file_range)
+	    !file_in->f_op)
  		return -EBADF;

-	inode_in = file_inode(file_in);
-	inode_out = file_inode(file_out);
-
-	/* make sure offsets don't wrap and the input is inside i_size */
-	if (pos_in + len < pos_in || pos_out + len < pos_out ||
-	    pos_in + len > i_size_read(inode_in))
-		return -EINVAL;
-
  	if (len == 0)
  		return 0;

@@ -1373,8 +1380,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
  	if (ret)
  		return ret;

-	ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
-					      len, flags);
+	ret = -EOPNOTSUPP;
+	if (file_out->f_op->copy_file_range)
+		ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+						      pos_out, len, flags);
+	if ((ret < 0) && (flags & COPY_FR_COPY))
+		ret = vfs_copy_file_pagecache(file_in, pos_in, file_out,
+					      pos_out, len);
  	if (ret > 0) {
  		fsnotify_access(file_in);
  		add_rchar(current, ret);
diff --git a/include/linux/copy.h b/include/linux/copy.h
new file mode 100644
index 0000000..fd54543
--- /dev/null
+++ b/include/linux/copy.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_COPY_H
+#define _LINUX_COPY_H
+
+#include <uapi/linux/copy.h>
+
+#endif /* _LINUX_COPY_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 70ff1d9..d46830a 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -90,6 +90,7 @@ header-y += coda_psdev.h
  header-y += coff.h
  header-y += connector.h
  header-y += const.h
+header-y += copy.h
  header-y += cramfs_fs.h
  header-y += cuda.h
  header-y += cyclades.h
diff --git a/include/uapi/linux/copy.h b/include/uapi/linux/copy.h
new file mode 100644
index 0000000..2da59a8
--- /dev/null
+++ b/include/uapi/linux/copy.h
@@ -0,0 +1,7 @@
+#ifndef _UAPI_LINUX_COPY_H
+#define _UAPI_LINUX_COPY_H
+
+#define COPY_FR_COPY		(1 << 0)  /* Only do a pagecache copy.  */
+#define COPY_FR_REFLINK		(1 << 1)  /* Only make a reflink.       */

Could I have a COPY_FR_DEDUPE flag too, please?

(I don't mind adding it myself when I get around to hooking up XFS, but I
was hoping to get it in during the first round).

I guess I can, but only iff everybody has agreed on using copy for dedupes
instead of somethink like fallocate.

I don't see how fallocate could even handle userspace-directed deduplication
since the dedupe operation compares two file ranges and reflinks them if the
contents match whereas fallocate only operates on a single file range.

A single file range would work if userspace was telling the kernel to start a
lengthy search for all duplicates of a particular file's range, but ...
yuck.

(Granted, the btrfs extent_same ioctl lets userspace call dedupe on multiple
files simultaneously, but that's for another day.)

Agreed, having something equivalent to reflinking would be the most sensible interface for deduplicating two files. I can however see fallocate possibly having some value when in-band deduplication is considered (ie, use fallocate to tell the system to include/exclude certain file ranges from in-band deduplication).


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux