Re: [PATCH v13] vfs: fix copy_file_range regression in cross-fs copies

Luís Henriques <lhenriques@xxxxxxx> · Fri, 20 May 2022 11:33:00 +0100

Amir Goldstein <amir73il@xxxxxxxxx> writes:

> From: Luis Henriques <lhenriques@xxxxxxx>
>
> A regression has been reported by Nicolas Boichat, found while using the
> copy_file_range syscall to copy a tracefs file.  Before commit
> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> kernel would return -EXDEV to userspace when trying to copy a file across
> different filesystems.  After this commit, the syscall doesn't fail anymore
> and instead returns zero (zero bytes copied), as this file's content is
> generated on-the-fly and thus reports a size of zero.
>
> Another regression has been reported by He Zhe and Paul Gortmaker -
> WARN_ON_ONCE(ret == -EOPNOTSUPP) can be triggered from userspace when
> copying from a sysfs file whose read operation may return -EOPNOTSUPP:
>
>   xfs_io -f -c "copy_range /sys/class/net/lo/phys_switch_id" /tmp/foo
>
> Since we do not have test coverage for copy_file_range() between any
> two types of filesystems, the best way to avoid with this sort of issues
> in the future is for the kernel to be more picky about filesystems that
> are allowed to do copy_file_range().
>
> This patch restores some cross-filesystem copy restrictions that existed
> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> devices").
>
> Filesystems that implement copy_file_range() or remap_file_range() may
> still fall-back to the generic_copy_file_range() implementation when the
> filesystem cannot handle the copy, so kernel can maintain a consistent
> story about which filesystems support copy_file_range().
> That will allow userspace tools to make consistent desicions w.r.t using
> copy_file_range().
>
> nfsd is also modified to fall-back into generic_copy_file_range() in case
> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>
> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drinkcat@xxxxxxxxxxxx/
> Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=ZmpsG47CyHFJwnw7zSX6Q@xxxxxxxxxxxxxx/
> Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
> Link: https://lore.kernel.org/linux-fsdevel/20210630161320.29006-1-lhenriques@xxxxxxx/
> Reported-by: Nicolas Boichat <drinkcat@xxxxxxxxxxxx>
> Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> Signed-off-by: Luis Henriques <lhenriques@xxxxxxx>
> Fixes: 64bf5ff58dff ("vfs: no fallback for ->copy_file_range")
> Link: https://lore.kernel.org/linux-fsdevel/20f17f64-88cb-4e80-07c1-85cb96c83619@xxxxxxxxxxxxx/
> Reported-by: He Zhe <zhe.he@xxxxxxxxxxxxx>
> Reported-by: Paul Gortmaker <paul.gortmaker@xxxxxxxxxxxxx>
> Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
> ---
>
> Al,
>
> Please take this patch to fix several issues that started to pop up
> as userspace tools started using copy_file_range(), which could only
> get more popular in the future.
>
> I've decides to pick this up from Luis' v12 last year [1] following
> some recent bug report. Although Luis' patch was reviewed by me back
> then and tested by Olga on nfs, I tested it now and found that it
> regressed fstests on xfs,btrfs, so decided on a slightly differnt logic.
>
> The rationale behind my change is that given src fs and dst fs, the
> outcomes of EOPNOTSUPP and EXDEV are consistent, while keeping the
> fs that may return success from this syscall to a minimum.
>
> I've tested this with the -g copy_range tests from fstests on
> ext4,xfs,btrfs,nfs,overlayfs (over xfs).
>
> For ext4, all tests are skipping as expected.
> For xfs,btrfs (support clone) the cross-fs copy test (genric/565)
> is skipped.
> For nfs,overlayfs the tests pass including the cross-fs copy test.

Thanks a lot for taking care of this, Amir.  I've also tested it with ceph
and didn't saw anything breaking (well... there was a bug in a
ceph-specific test, but that's unrelated).  So, FWIW:

Reviewed-by: Luís Henriques <lhenriques@xxxxxxx>
Tested-by: Luís Henriques <lhenriques@xxxxxxx>

Cheers
-- 
Luís

>
> Thanks,
> Amir.
>
> [1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxh6PegaOtMXQ9WmU=05bhQfYTeweGjFWR7T+XVAbuR09A@xxxxxxxxxxxxxx/
>
>  fs/nfsd/vfs.c   |  8 ++++-
>  fs/read_write.c | 79 +++++++++++++++++++++++++++----------------------
>  2 files changed, 50 insertions(+), 37 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index c22ad0532e8e..74f7fef48504 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -577,6 +577,7 @@ __be32 nfsd4_clone_file_range(struct svc_rqst *rqstp,
>  ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
>  			     u64 dst_pos, u64 count)
>  {
> +	ssize_t ret;
>  
>  	/*
>  	 * Limit copy to 4MB to prevent indefinitely blocking an nfsd
> @@ -587,7 +588,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
>  	 * limit like this and pipeline multiple COPY requests.
>  	 */
>  	count = min_t(u64, count, 1 << 22);
> -	return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> +	ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> +
> +	if (ret == -EOPNOTSUPP || ret == -EXDEV)
> +		ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
> +					      count, 0);
> +	return ret;
>  }
>  
>  __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
> diff --git a/fs/read_write.c b/fs/read_write.c
> index e643aec2b0ef..a8f12e91762f 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1381,28 +1381,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
>  }
>  EXPORT_SYMBOL(generic_copy_file_range);
>  
> -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> -				  struct file *file_out, loff_t pos_out,
> -				  size_t len, unsigned int flags)
> -{
> -	/*
> -	 * Although we now allow filesystems to handle cross sb copy, passing
> -	 * a file of the wrong filesystem type to filesystem driver can result
> -	 * in an attempt to dereference the wrong type of ->private_data, so
> -	 * avoid doing that until we really have a good reason.  NFS defines
> -	 * several different file_system_type structures, but they all end up
> -	 * using the same ->copy_file_range() function pointer.
> -	 */
> -	if (file_out->f_op->copy_file_range &&
> -	    file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
> -		return file_out->f_op->copy_file_range(file_in, pos_in,
> -						       file_out, pos_out,
> -						       len, flags);
> -
> -	return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> -				       flags);
> -}
> -
>  /*
>   * Performs necessary checks before doing a file copy
>   *
> @@ -1424,6 +1402,25 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>  	if (ret)
>  		return ret;
>  
> +	/*
> +	 * Although we now allow filesystems to handle cross sb copy, passing
> +	 * a file of the wrong filesystem type to filesystem driver can result
> +	 * in an attempt to dereference the wrong type of ->private_data, so
> +	 * avoid doing that until we really have a good reason.  NFS defines
> +	 * several different file_system_type structures, but they all end up
> +	 * using the same ->copy_file_range() function pointer.
> +	 */
> +	if (file_out->f_op->copy_file_range) {
> +		if (file_in->f_op->copy_file_range !=
> +		    file_out->f_op->copy_file_range)
> +			return -EXDEV;
> +	} else if (file_in->f_op->remap_file_range) {
> +		if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> +			return -EXDEV;
> +	} else {
> +                return -EOPNOTSUPP;
> +	}
> +
>  	/* Don't touch certain kinds of inodes */
>  	if (IS_IMMUTABLE(inode_out))
>  		return -EPERM;
> @@ -1489,26 +1486,36 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>  	file_start_write(file_out);
>  
>  	/*
> -	 * Try cloning first, this is supported by more file systems, and
> -	 * more efficient if both clone and copy are supported (e.g. NFS).
> +	 * Cloning is supported by more file systems, so we implement copy on
> +	 * same sb using clone, but for filesystems were both clone and copy
> +	 * are supported (e.g. NFS), we only call the copy method.
>  	 */
> -	if (file_in->f_op->remap_file_range &&
> -	    file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
> -		loff_t cloned;
> -
> -		cloned = file_in->f_op->remap_file_range(file_in, pos_in,
> +	if (file_out->f_op->copy_file_range) {
> +		ret = file_out->f_op->copy_file_range(file_in, pos_in,
> +						      file_out, pos_out,
> +						      len, flags);
> +	} else if (file_in->f_op->remap_file_range &&
> +		   file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
> +		ret = file_in->f_op->remap_file_range(file_in, pos_in,
>  				file_out, pos_out,
>  				min_t(loff_t, MAX_RW_COUNT, len),
>  				REMAP_FILE_CAN_SHORTEN);
> -		if (cloned > 0) {
> -			ret = cloned;
> -			goto done;
> -		}
>  	}
>  
> -	ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> -				flags);
> -	WARN_ON_ONCE(ret == -EOPNOTSUPP);
> +	if (ret > 0)
> +		goto done;
> +
> +	/*
> +	 * We can get here if filesystem supports clone but rejected the clone
> +	 * request (e.g. because it was not block aligned).
> +	 * In that case, fall back to kernel copy so we are able to maintain a
> +	 * consistent story about which filesystems support copy_file_range()
> +	 * and which filesystems do not, that will allow userspace tools to
> +	 * make consistent desicions w.r.t using copy_file_range().
> +	 */
> +	ret = generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> +				      flags);
> +
>  done:
>  	if (ret > 0) {
>  		fsnotify_access(file_in);
> -- 
>
> 2.25.1
>