Re: [PATCH v7 3/4] ceph: support copy_file_range file operation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"Yan, Zheng" <ukernel@xxxxxxxxx> writes:

> On Mon, Oct 22, 2018 at 11:25 PM Luis Henriques <lhenriques@xxxxxxxx> wrote:
>>
>> Luis Henriques <lhenriques@xxxxxxxx> writes:
>> <snip>
>> > +static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> > +                                 struct file *dst_file, loff_t dst_off,
>> > +                                 size_t len, unsigned int flags)
>> > +{
>> > +     struct inode *src_inode = file_inode(src_file);
>> > +     struct inode *dst_inode = file_inode(dst_file);
>> > +     struct ceph_inode_info *src_ci = ceph_inode(src_inode);
>> > +     struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
>> > +     struct ceph_cap_flush *prealloc_cf;
>> > +     struct ceph_object_locator src_oloc, dst_oloc;
>> > +     struct ceph_object_id src_oid, dst_oid;
>> > +     loff_t endoff = 0, size;
>> > +     ssize_t ret = -EIO;
>> > +     u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
>> > +     u32 src_objlen, dst_objlen, object_size;
>> > +     int src_got = 0, dst_got = 0, err, dirty;
>> > +     bool do_final_copy = false;
>> > +
>> > +     if (src_inode == dst_inode)
>> > +             return -EINVAL;
>> > +     if (ceph_snap(dst_inode) != CEPH_NOSNAP)
>> > +             return -EROFS;
>> > +
>> > +     /*
>> > +      * Some of the checks below will return -EOPNOTSUPP, which will force a
>> > +      * fallback to the default VFS copy_file_range implementation.  This is
>> > +      * desirable in several cases (for ex, the 'len' is smaller than the
>> > +      * size of the objects, or in cases where that would be more
>> > +      * efficient).
>> > +      */
>> > +
>> > +     if ((src_ci->i_layout.stripe_unit != dst_ci->i_layout.stripe_unit) ||
>> > +         (src_ci->i_layout.stripe_count != dst_ci->i_layout.stripe_count) ||
>> > +         (src_ci->i_layout.object_size != dst_ci->i_layout.object_size))
>> > +             return -EOPNOTSUPP;
>> > +
>> > +     if (len < src_ci->i_layout.object_size)
>> > +             return -EOPNOTSUPP; /* no remote copy will be done */
>> > +
>> > +     prealloc_cf = ceph_alloc_cap_flush();
>> > +     if (!prealloc_cf)
>> > +             return -ENOMEM;
>> > +
>> > +     /* Start by sync'ing the source file */
>> > +     ret = file_write_and_wait_range(src_file, src_off, (src_off + len));
>> > +     if (ret < 0)
>> > +             goto out;
>> > +
>> > +     /*
>> > +      * We need FILE_WR caps for dst_ci and FILE_RD for src_ci as other
>> > +      * clients may have dirty data in their caches.  And OSDs know nothing
>> > +      * about caps, so they can't safely do the remote object copies.
>> > +      */
>> > +     err = get_rd_wr_caps(src_ci, (src_off + len), &src_got,
>> > +                          dst_ci, (dst_off + len), &dst_got);
>> > +     if (err < 0) {
>> > +             dout("get_rd_wr_caps returned %d\n", err);
>> > +             ret = -EOPNOTSUPP;
>> > +             goto out;
>> > +     }
>> > +
>> > +     ret = is_file_size_ok(src_inode, dst_inode, src_off, dst_off, len);
>> > +     if (ret < 0)
>> > +             goto out_caps;
>> > +
>> > +     size = i_size_read(dst_inode);
>> > +     endoff = dst_off + len;
>> > +
>> > +     /* Drop dst file cached pages */
>> > +     ret = invalidate_inode_pages2_range(dst_inode->i_mapping,
>> > +                                         dst_off >> PAGE_SHIFT,
>> > +                                         endoff >> PAGE_SHIFT);
>> > +     if (ret < 0) {
>> > +             dout("Failed to invalidate inode pages (%ld)\n", ret);
>> > +             ret = 0; /* XXX */
>> > +     }
>>
>> I believe I found an issue with this code: if we try to copy into a file
>> that was just written, the remote copied data will be overwritten by our
>> buffered writes once they are flushed.  When this happens we can see the
>> above message in the kernel (with a -EBUSY error).
>>
>> Here's the 3 options I see to fixes this:
>>
>> 1. Add another call to file_write_and_wait_range(), with the dst_file.
>>    I believe this souldn't introduce a huge overhead for the cases where
>>    the file isn't buffered.
>> 2. Add the call to file_write_and_wait_range() only if
>>    invalidate_inode_pages2_range fails.  We'll need to drop/get caps for
>>    this.
>> 3. All of the above.  I'm not sure we need to do both, but I guess we'll
>>    need to at least fail the operation and return the error if
>>    invalidate_inode_pages2_range fails.
>>
>> Please let me know what do you think.  And if you want me to send a
>> patch to fix this or if you rather have me sending the whole thing
>> again.
>>
>
> the direct write code calls both filemap_write_and_wait_range and
> invalidate_inode_pages2_range. I think we should do the same here
> (option 1).

Ok, great.  I'll send out a fix for that in a bit.

Cheers,
-- 
Luis



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux