Re: [PATCH] ceph: allow object copies across different filesystems in the same cluster

Jeff Layton <jlayton@xxxxxxxxxx> · Sat, 07 Sep 2019 09:53:29 -0400

On Fri, 2019-09-06 at 17:26 +0100, Luis Henriques wrote:
> "Jeff Layton" <jlayton@xxxxxxxxxx> writes:
> 
> > On Fri, 2019-09-06 at 14:57 +0100, Luis Henriques wrote:
> > > OSDs are able to perform object copies across different pools.  Thus,
> > > there's no need to prevent copy_file_range from doing remote copies if the
> > > source and destination superblocks are different.  Only return -EXDEV if
> > > they have different fsid (the cluster ID).
> > > 
> > > Signed-off-by: Luis Henriques <lhenriques@xxxxxxxx>
> > > ---
> > >  fs/ceph/file.c | 23 +++++++++++++++++++----
> > >  1 file changed, 19 insertions(+), 4 deletions(-)
> > > 
> > > Hi!
> > > 
> > > I've finally managed to run some tests using multiple filesystems, both
> > > within a single cluster and also using two different clusters.  The
> > > behaviour of copy_file_range (with this patch, of course) was what I
> > > expected:
> > > 
> > >   - Object copies work fine across different filesystems within the same
> > >     cluster (even with pools in different PGs);
> > >   - -EXDEV is returned if the fsid is different
> > > 
> > > (OT: I wonder why the cluster ID is named 'fsid'; historical reasons?
> > >  Because this is actually what's in ceph.conf fsid in "[global]"
> > >  section.  Anyway...)
> > > 
> > > So, what's missing right now is (I always mention this when I have the
> > > opportunity!) to merge https://github.com/ceph/ceph/pull/25374 :-)
> > > And add the corresponding support for the new flag to the kernel
> > > client, of course.
> > > 
> > > Cheers,
> > > --
> > > Luis
> > > 
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index 685a03cc4b77..88d116893c2b 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -1904,6 +1904,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >  	struct ceph_inode_info *src_ci = ceph_inode(src_inode);
> > >  	struct ceph_inode_info *dst_ci = ceph_inode(dst_inode);
> > >  	struct ceph_cap_flush *prealloc_cf;
> > > +	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
> > >  	struct ceph_object_locator src_oloc, dst_oloc;
> > >  	struct ceph_object_id src_oid, dst_oid;
> > >  	loff_t endoff = 0, size;
> > > @@ -1915,8 +1916,22 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >  
> > >  	if (src_inode == dst_inode)
> > >  		return -EINVAL;
> > > -	if (src_inode->i_sb != dst_inode->i_sb)
> > > -		return -EXDEV;
> > > +	if (src_inode->i_sb != dst_inode->i_sb) {
> > > +		struct ceph_fs_client *dst_fsc = ceph_inode_to_client(dst_inode);
> > > +
> > > +		if (!src_fsc->client->have_fsid || !dst_fsc->client->have_fsid) {
> > > +			dout("No fsid in a fs client\n");
> > > +			return -EXDEV;
> > > +		}
> > 
> > In what situation is there no fsid? Old cluster version?
> > 
> > If there is no fsid, can we take that to indicate that there is only a
> > single filesystem possible in the cluster and that we should attempt the
> > copy anyway?
> 
> TBH I'm not sure if 'have_fsid' can ever be 'false' in this call.  It is
> set to 'true' when handling the monmap, and it's never changed back to
> 'false'.  Since I don't think copy_file_range will be invoked *before*
> we get the monmap, it should be safe to drop this check.  Maybe it could
> be replaced it by a WARN_ON()?
> 

Yeah. I think the have_fsid flag just allows us to avoid the pr_err msg
in ceph_check_fsid when the client is initially created. Maybe there is
some better way to achieve that?

In any case, I'd just drop that condition here.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>