On Thu, Nov 30, 2023 at 3:37 PM Jan Kara <jack@xxxxxxx> wrote: > > On Thu 30-11-23 12:07:23, Amir Goldstein wrote: > > On Thu, Nov 30, 2023 at 10:32 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > > > On Wed, Nov 29, 2023 at 10:07 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > > > > > Christian, > > > > > > > > Josef has helped me see the light and figure out how to avoid the > > > > possible deadlock, which involves: > > > > - splice() from source file in a loop mounted fs to dest file in > > > > a host fs, where the loop image file is > > > > - fsfreeze on host fs > > > > - write to host fs in context of fanotify permission event handler > > > > (FAN_ACCESS_PERM) on the splice source file > > > > > > > > The first patch should not be changing any logic. > > > > I only build tested the ceph patch, so hoping to get an > > > > Acked-by/Tested-by from Jeff. > > > > > > > > The second patch rids us of the deadlock by not holding > > > > file_start_write() while reading from splice source file. > > > > > > > > > > OOPS, I missed another corner case: > > > The COPY_FILE_SPLICE fallback of server-side-copy in nfsd/ksmbd > > > needs to use the start-write-safe variant of do_splice_direct(), > > > because in this case src and dst can be on any two fs. > > > Expect an extra patch in v2. > > > > > > > For the interested, see server-side-copy patch below. > > Pushed to branch start-write-safe [1], but will wait with v2 until > > I get comments on v1. > > > > Thanks, > > Amir. > > > > [1] https://github.com/amir73il/linux/commits/start-write-safe > > > > Author: Amir Goldstein <amir73il@xxxxxxxxx> > > Date: Thu Nov 30 11:42:50 2023 +0200 > > > > fs: use do_splice_direct() for nfsd/ksmbd server-side-copy > > > > nfsd/ksmbd call vfs_copy_file_range() with flag COPY_FILE_SPLICE to > > perform kernel copy between two files on any two filesystems. > > > > Splicing input file, while holding file_start_write() on the output file > > which is on a different sb, posses a risk for fanotify related deadlocks. > > > > We only need to call splice_file_range() from within the context of > > ->copy_file_range() filesystem methods with file_start_write() held. > > > > To avoid the possible deadlocks, always use do_splice_direct() instead of > > splice_file_range() for the kernel copy fallback in vfs_copy_file_range() > > without holding file_start_write(). > > > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> > > > > diff --git a/fs/read_write.c b/fs/read_write.c > > index 0bc99f38e623..12583e32aa6d 100644 > > --- a/fs/read_write.c > > +++ b/fs/read_write.c > > @@ -1565,11 +1565,18 @@ ssize_t vfs_copy_file_range(struct file > > *file_in, loff_t pos_in, > > * and which filesystems do not, that will allow userspace tools to > > * make consistent desicions w.r.t using copy_file_range(). > > * > > - * We also get here if caller (e.g. nfsd) requested COPY_FILE_SPLICE. > > + * We also get here if caller (e.g. nfsd) requested COPY_FILE_SPLICE > > + * for server-side-copy between any two sb. > > + * > > + * In any case, we call do_splice_direct() and not splice_file_range(), > > + * without file_start_write() held, to avoid possible deadlocks related > > + * to splicing from input file, while file_start_write() is held on > > + * the output file on a different sb. > > */ > > - ret = generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, > > - flags); > > + file_end_write(file_out); > > > > + ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, > > + min_t(size_t, len, MAX_RW_COUNT), 0); > > done: > > if (ret > 0) { > > fsnotify_access(file_in); > > @@ -1581,8 +1588,6 @@ ssize_t vfs_copy_file_range(struct file > > *file_in, loff_t pos_in, > > inc_syscr(current); > > inc_syscw(current); > > > > - file_end_write(file_out); > > - > > This file_end_write() is also used by the paths using ->copy_file_range() > and ->remap_file_range() so you need to balance those... You're right! already found that bug and fixed in my branch: diff --git a/fs/read_write.c b/fs/read_write.c index 0bc99f38e623..e0c2c1b5962b 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1421,6 +1421,10 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, size_t len, unsigned int flags) { + /* May only be called from within ->copy_file_range() methods */ + if (WARN_ON_ONCE(flags)) + return -EINVAL; + return splice_file_range(file_in, &pos_in, file_out, &pos_out, min_t(size_t, len, MAX_RW_COUNT)); } @@ -1541,19 +1545,22 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out, len, flags); - goto done; - } - - if (!splice && file_in->f_op->remap_file_range && - file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) { + } else if (!splice && file_in->f_op->remap_file_range && + file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) { ret = file_in->f_op->remap_file_range(file_in, pos_in, file_out, pos_out, min_t(loff_t, MAX_RW_COUNT, len), REMAP_FILE_CAN_SHORTEN); - if (ret > 0) - goto done; + /* fallback to splice */ + if (ret <= 0) + splice = true; } + file_end_write(file_out); + + if (!splice) + goto done; + /* * We can get here for same sb copy of filesystems that do not implement * ->copy_file_range() in case filesystem does not support clone or in @@ -1565,11 +1572,16 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, * and which filesystems do not, that will allow userspace tools to * make consistent desicions w.r.t using copy_file_range(). * - * We also get here if caller (e.g. nfsd) requested COPY_FILE_SPLICE. + * We also get here if caller (e.g. nfsd) requested COPY_FILE_SPLICE + * for server-side-copy between any two sb. + * + * In any case, we call do_splice_direct() and not splice_file_range(), + * without file_start_write() held, to avoid possible deadlocks related + * to splicing from input file, while file_start_write() is held on + * the output file on a different sb. */ - ret = generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, - flags); - + ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, + min_t(size_t, len, MAX_RW_COUNT), 0); done: if (ret > 0) { fsnotify_access(file_in); @@ -1581,8 +1593,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, inc_syscr(current); inc_syscw(current); - file_end_write(file_out); - return ret; } EXPORT_SYMBOL(vfs_copy_file_range);