Re: About RADOS level replication

Xuehan Xu <xxhdx1985126@xxxxxxxxx> · Mon, 4 Jun 2018 15:31:08 +0800

> I think that in order to make the metadata_diff efficient, we still need
> to rely on the rstats.  For example, if you modify the file
> /a/b/c/d/e/f/g, then g's ctime will change, but you'll still have to
> traverse the entire hierarchy in order to discover that.  The
> rctime-informed search will let us efficiently find those changes.
>
> ...and if we have the rctimes, then the only real difference is whether we
> do a full readdir for the directory and look at each file's ctime and
> rctime, or whether that filtering is done on the MDS side.  It's probably
> a bit faster with the MDS's help, but it needs a special tool, while
> simply modifying rsync to use rctime would work almost as well.
>
>> For the import-diff part, I think we can go this way: first, apply
>> diffs of files in the target subtree, and then, do "setattr" all the
>> files and directories that has been modified between the two snapshots
>> to make their metadata exactly the same as their counter parts on the
>> source filesystem.
>
> rsync does with this the appropriate options.
>
> It seems like the weak link in all of this is making sure the rctimes are
> coherent/correct.  We could have a fs-wide sync-like operation that
> flushes all of the rstats down to the root or something.
>
> Am I missing something?

Hi, sage. I think I get your point. I guess the reason that rstats is
updated lazily is that updated all the parents along the branch in
which the file modified exists is too expensive, since it means every
write under the subtree root would lead to an update of the root
inode. Is this right? If so, I think maybe we can overcome this
problem in this way: say, we are calculating a snapshot diff of the
directory DIR_X between snapshot A and B. We don't have to know the
exact most recent rctime of DIR_X, all we need to know is whether
there are files/dirs in the subtree of DIR_X that are modified after
snapshot A. So, maybe we can do this: say, there is a file
DIR_X/a/b/c/d, if we make the first modification to this file create
an old inode for every parent along the branch, when we do
metadata_diff for DIR_X, we would see that there is an old inode of
DIR_X for snapshot A, then we know we should go into DIR_X, and its
subdir "a", and subdir "b" of its subdir "a", and so on. Because only
the first modification would lead to the creation of old inodes along
the branch, its overhead should be tolerable.

I don't whether I am make myself clear or whether I'm considering this
in the right way.
If I'm right, I can get down to implement a prototype for this.

Thank you:-)

>
>> For the metadata_diff, I've already tried to implement a prototype:
>> https://github.com/xxhdx1985126/ceph/commit/d55ea8a23738268e19e7dd6a43bb1a89929e9d22
>> Please take a look if any of you guys have the time:-)
>
> The change looks reasonble, but I think it's an optimization, and
> much less important than avoid a traversal of the entire namespace...
>
> sage
>
>
>> If this approach sounds reasonable to you, can I add a feature to the
>> Ceph issue tracker and go on implementing it? Thank you:-)
>>
>> On 12 May 2018 at 11:34, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> > On Fri, 11 May 2018, Xuehan Xu wrote:
>> >> > Given the description here, I think it would be much easier and more
>> >> > efficient to implement sync at the CephFS level. The simplest option
>> >> > is a "smart rsync" that is aware of CephFS' recursive statistics. You
>> >> > then start at the root, look at each file or folder to see if its
>> >> > (recursive) timestamp is newer than your last sync[1], and if it is,
>> >> > you check out the children. Do an rsync on each individual file, and
>> >> > profit!
>> >> >
>> >> > Now you have some choices to make on the consistency model — as Sage
>> >> > says, you may want to take snapshots and do this work on a snapshot so
>> >> > files are all consistent with each other, though that does impose more
>> >> > work to take and clean up snapshots. Or you may just do it on a
>> >> > file-by-file basis, in which case it's even easier to just do this
>> >> > from a normal client. (Which is not to say you have to; you could also
>> >> > build it inside the MDS or in some special privileged daemon or
>> >> > something.)
>> >> >
>> >> > I find an approach like this attractive for a few reasons.
>> >> > First of all, it's less invasive to the core code, which means it
>> >> > interacts with fewer parts of the system, is easier to maintain in
>> >> > itself, and doesn't impose costs on maintaining other parts of Ceph.
>> >> > Second, CephFS is unlike RADOS in that it has a centralized,
>> >> > consistent set of metadata for tracking all its data. We can take
>> >> > advantage of that metadata (or *add* to it!) to make the job of
>> >> > synchronizing easier. In RADOS, we are stuck with running algorithms
>> >> > that parallelize, and being very careful to minimize the central
>> >> > coordination points. That's good for scale, but very bad for ease of
>> >> > understanding and development.
>> >> > -Greg
>> >> > [1]: This is *slightly* trickier than I make it sound to get right, as
>> >> > the rstats are updated *lazily*. So you may run a sync at 16:00:00 and
>> >> > miss a file down the tree that was updated at 15:59:10 because that
>> >> > change hasn't gotten all the way to the root. You'd probably want a
>> >> > database associated with the sync daemon that tracks the timestamps
>> >> > you saw at the previous sync. If you wanted to build this into the MDS
>> >> > system with its own metadata you could look at how forward scrub works
>> >> > for a model.
>> >>
>> >> I think this "smart rsync" should be an appropriate way for our
>> >> current need. And I think that maybe we can reuse the snapshot
>> >> mechanism in this "smart sync". When we find some files has been
>> >> modified, we make snapshots only for those files that are going to be
>> >> copied, and apply the diffs between snapshots to the other clusters.
>> >> In this way, I think we should be able to save those bandwidth
>> >> (network and disk) used for copying unmodified area of files. Is this
>> >> right?
>> >
>> > Yes, in that if you teach rsync about cephfs recursive stats, you can just
>> > as easily run it on a snapshot as on the live data.  One other comment,
>> > too: as Greg mentioned the cephfs rstats are somewhat lazily propagated to
>> > the client.  However, I can also imagine, if we go down this path, that we
>> > might work out a way for the client to request the latest, consistent
>> > rstats from the MDS.  (I think we may already have this for the snapshot
>> > view, or be quite close to it--Zheng would know best as he's been
>> > actively working on this code.)
>> >
>> > In any case, I think it would be awesome to try to modify normal rsync
>> > in a way that makes it notice it is on a CephFS mount, and if so, use the
>> > rctime to avoid checking subdirectories that have not changed... either
>> > via a special flag, or automatically based on the fs time returned by
>> > statfs.
>> >
>> >> Finally, I think, although we don't have an urgent need for an "ops
>> >> replication" mechanism in CephFS for now, we should take precautions.
>> >> So basically, I think maybe we can implement the final "top-level"
>> >> cephfs replication in three steps: first, we implement a "smart
>> >> rsync"; then a replication mechanism with point-in-time consistency at
>> >> file level; and finally, when we have all the man power and the
>> >> resources, a replication mechanism with system-wide point-in-time
>> >> consistency. Does this sound reasonable to you? Thanks.
>> >
>> > I'm not immediately sure the 2 and 3 steps would look like from this
>> > description, but agree that rsync should be step 1.
>> >
>> > I have one other option, though, that may or not be what you had in mind.
>> > Let's call is the "snapmirror" mode, since it's roughly what the
>> > netapp feature with that name is doing:
>> >
>> > 1- Automatically take snapshot S of the source filesystem.
>> >
>> > 2- Do an efficiently traversal of the new snapshot S vs the previous one
>> > (S-1) and send the diff to the remote site.  Once the whole diff is
>> > transferred, apply it to the remote fs.  (Or, on the remote site, apply
>> > the changes directly.  If we fail to do the whole thing, roll back to
>> > snapshot S-1).
>> >
>> > 3- take snapshot S on remote fs
>> >
>> > 4- remove snapshot S-1 on local and remote fs
>> >
>> > 5- wait N minutes, then repeat
>> >
>> > I think the key here, in the end, is to make the setup/configuration of
>> > this sort of thing easy to understand, and to provide an easy, transparent
>> > view of the current sync status.
>> >
>> > sage
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html