On Tue, Feb 16, 2021 at 11:15 PM Steve French <smfrench@xxxxxxxxx> wrote: > > On Tue, Feb 16, 2021 at 1:40 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > On Tue, Feb 16, 2021 at 9:31 PM Steve French <smfrench@xxxxxxxxx> wrote: > > > > > > On Tue, Feb 16, 2021 at 1:29 PM Anna Schumaker > > > <anna.schumaker@xxxxxxxxxx> wrote: > > > > > > > > On Tue, Feb 16, 2021 at 2:22 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > > > > > > > On Tue, Feb 16, 2021 at 8:54 PM Luis Henriques <lhenriques@xxxxxxx> wrote: > > > > > > > > > > > > Amir Goldstein <amir73il@xxxxxxxxx> writes: > > > > > > > > > > > > > On Tue, Feb 16, 2021 at 6:41 PM Luis Henriques <lhenriques@xxxxxxx> wrote: > > > > > > >> > > > > > > >> Amir Goldstein <amir73il@xxxxxxxxx> writes: > > > > > > >> > > > > > > >> >> Ugh. And I guess overlayfs may have a similar problem. > > > > > > >> > > > > > > > >> > Not exactly. > > > > > > >> > Generally speaking, overlayfs should call vfs_copy_file_range() > > > > > > >> > with the flags it got from layer above, so if called from nfsd it > > > > > > >> > will allow cross fs copy and when called from syscall it won't. > > > > > > >> > > > > > > > >> > There are some corner cases where overlayfs could benefit from > > > > > > >> > COPY_FILE_SPLICE (e.g. copy from lower file to upper file), but > > > > > > >> > let's leave those for now. Just leave overlayfs code as is. > > > > > > >> > > > > > > >> Got it, thanks for clarifying. > > > > > > >> > > > > > > >> >> > This is easy to solve with a flag COPY_FILE_SPLICE (or something) that > > > > > > >> >> > is internal to kernel users. > > > > > > >> >> > > > > > > > >> >> > FWIW, you may want to look at the loop in ovl_copy_up_data() > > > > > > >> >> > for improvements to nfsd_copy_file_range(). > > > > > > >> >> > > > > > > > >> >> > We can move the check out to copy_file_range syscall: > > > > > > >> >> > > > > > > > >> >> > if (flags != 0) > > > > > > >> >> > return -EINVAL; > > > > > > >> >> > > > > > > > >> >> > Leave the fallback from all filesystems and check for the > > > > > > >> >> > COPY_FILE_SPLICE flag inside generic_copy_file_range(). > > > > > > >> >> > > > > > > >> >> Ok, the diff bellow is just to make sure I understood your suggestion. > > > > > > >> >> > > > > > > >> >> The patch will also need to: > > > > > > >> >> > > > > > > >> >> - change nfs and overlayfs calls to vfs_copy_file_range() so that they > > > > > > >> >> use the new flag. > > > > > > >> >> > > > > > > >> >> - check flags in generic_copy_file_checks() to make sure only valid flags > > > > > > >> >> are used (COPY_FILE_SPLICE at the moment). > > > > > > >> >> > > > > > > >> >> Also, where should this flag be defined? include/uapi/linux/fs.h? > > > > > > >> > > > > > > > >> > Grep for REMAP_FILE_ > > > > > > >> > Same header file, same Documentation rst file. > > > > > > >> > > > > > > > >> >> > > > > > > >> >> Cheers, > > > > > > >> >> -- > > > > > > >> >> Luis > > > > > > >> >> > > > > > > >> >> diff --git a/fs/read_write.c b/fs/read_write.c > > > > > > >> >> index 75f764b43418..341d315d2a96 100644 > > > > > > >> >> --- a/fs/read_write.c > > > > > > >> >> +++ b/fs/read_write.c > > > > > > >> >> @@ -1383,6 +1383,13 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in, > > > > > > >> >> struct file *file_out, loff_t pos_out, > > > > > > >> >> size_t len, unsigned int flags) > > > > > > >> >> { > > > > > > >> >> + if (!(flags & COPY_FILE_SPLICE)) { > > > > > > >> >> + if (!file_out->f_op->copy_file_range) > > > > > > >> >> + return -EOPNOTSUPP; > > > > > > >> >> + else if (file_out->f_op->copy_file_range != > > > > > > >> >> + file_in->f_op->copy_file_range) > > > > > > >> >> + return -EXDEV; > > > > > > >> >> + } > > > > > > >> > > > > > > > >> > That looks strange, because you are duplicating the logic in > > > > > > >> > do_copy_file_range(). Maybe better: > > > > > > >> > > > > > > > >> > if (WARN_ON_ONCE(flags & ~COPY_FILE_SPLICE)) > > > > > > >> > return -EINVAL; > > > > > > >> > if (flags & COPY_FILE_SPLICE) > > > > > > >> > return do_splice_direct(file_in, &pos_in, file_out, &pos_out, > > > > > > >> > len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0); > > > > > > >> > > > > > > >> My initial reasoning for duplicating the logic in do_copy_file_range() was > > > > > > >> to allow the generic_copy_file_range() callers to be left unmodified and > > > > > > >> allow the filesystems to default to this implementation. > > > > > > >> > > > > > > >> With this change, I guess that the calls to generic_copy_file_range() from > > > > > > >> the different filesystems can be dropped, as in my initial patch, as they > > > > > > >> will always get -EINVAL. The other option would be to set the > > > > > > >> COPY_FILE_SPLICE flag in those calls, but that would get us back to the > > > > > > >> problem we're trying to solve. > > > > > > > > > > > > > > I don't understand the problem. > > > > > > > > > > > > > > What exactly is wrong with the code I suggested? > > > > > > > Why should any filesystem be changed? > > > > > > > > > > > > > > Maybe I am missing something. > > > > > > > > > > > > Ok, I have to do a full brain reboot and start all over. > > > > > > > > > > > > Before that, I picked the code you suggested and tested it. I've mounted > > > > > > a cephfs filesystem and used xfs_io to execute a 'copy_range' command > > > > > > using /sys/kernel/debug/sched_features as source. The result was a > > > > > > 0-sized file in cephfs. And the reason is thevfs_copy_file_range() > > > > > > early exit in: > > > > > > > > > > > > if (len == 0) > > > > > > return 0; > > > > > > > > > > > > 'len' is set in generic_copy_file_checks(). > > > > > > > > > > Good point.. I guess we will need to do all the checks earlier in > > > > > generic_copy_file_checks() including the logic of: > > > > > > > > > > if (file_in->f_op->remap_file_range && > > > > > file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) > > > > > > > > > > > > > > > > > > > > > > This means that we're not solving the original problem anymore (probably > > > > > > since v1 of this patch, haven't checked). > > > > > > > > > > > > Also, re-reading Trond's emails, I read: "... also disallowing the copy > > > > > > from, say, an XFS formatted partition to an ext4 partition". Isn't that > > > > > > *exactly* what we're trying to do here? I.e. _prevent_ these copies from > > > > > > happening so that tracefs files can't be CFR'ed? > > > > > > > > > > > > > > > > We want to address the report which means calls coming from > > > > > copy_file_range() syscall. > > > > > > > > > > Trond's use case is vfs_copy_file_range() coming from nfsd. > > > > > When he writes about copy from XFS to ext4, he means an > > > > > NFS client is issuing server side copy (on same or different NFS mounts) > > > > > and the NFS server is executing nfsd_copy_file_range() on a source > > > > > file that happens to be on XFS and destination happens to be on ext4. > > > > > > > > NFS also supports a server-to-server copy where the destination server > > > > mounts the source server and reads the data to be copied. Please don't > > > > break that either :) > > > > > > > As long as the copy is via nfsd_copy_file_range() and not from the syscall > > it should not regress. > > > > > This is a case we will eventually need to support for cifs (SMB3) as well. > > > > > > > samba already does server side copy very well without needing any support > > from the kernel. > > > > nfsd also doesn't *need* to use vfs_copy_file_range() it can use kernel APIs > > like the loop in ovl_copy_up_data(). But it does, so we should not regress it. > > > > samba/nfsd can try to use copy_file_range() and it will work if the > > source/target > > fs support it. Otherwise, the server can perfectly well do the copy via other > > available interfaces, just like userspace copy tools. > > I was thinking about cifsd ("ksmbd") the kernel server from > Namjae/Sergey etc. which is making excellent progress. > You are missing my point. Never mind which server. The server does not *need* to rely on vfs_copy_file_range() to copy files from XFS to ext4. The server is very capable of implementing the fallback generic copy in case source/target fs do not support native {copy,remap}_file_range(). w.r.t semantics of copy_file_range() syscall vs. the fallback to userespace 'cp' tool (check source file size before copy or not), please note that the semantics of CIFS_IOC_COPYCHUNK_FILE are that of the former: rc = cifs_file_copychunk_range(xid, src_file.file, 0, dst_file, 0, src_inode->i_size, 0); It will copy zero bytes if advertised source file size if zero. NFS server side copy semantics are currently de-facto the same because both the client and the server will have to pass through this line in vfs_copy_file_range(): if (len == 0) return 0; IMO, and this opinion was voiced by several other filesystem developers, the shortend copy semantics are the correct semantics for copy_file_range() syscall as well as for vfs_copy_file_range() for internal kernel users. I guess what this means is that if the 'cp' tool ever tries an opportunistic copy_file_range() syscall (e.g. --cfr=auto), it may result in zero size copy. Thanks, Amir.