Re: About RADOS level replication

Xuehan Xu <xxhdx1985126@xxxxxxxxx> · Sun, 3 Jun 2018 11:22:16 -0400

Hi, sage. I think this "snapmirror" way may be better, since it
doesn't involve the "rstat" lazy updating problem and it wouldn't copy
non-modified files.

As you said, to go this way, we have to do a recursive snapshot diff
calculation down the subtree that we are doing replication in, and
send the diff the the remote site. I think maybe a similar way to
rbd's "export-diff/import-diff" mechanism should be taken, so the
whole replication process can be separate into two sub processes:
"export-diff" of the subtree on the source filesystem and
"import-diff" on the remote filesystem. And I think we can separate
the snapshot diff calculation into two parts: metadata snapshot diff
and data snapshot diff. And correspondingly, we can implement two new
api method at the file system layer: metadata_diff and data_diff.

To implement the metadata_diff, I think we can add a few codes to
"handle_client_readdir" method to make capable of selecting those
entries that has been modified during two snapshots. And if the
clients want to do a diff calculation down the subtree, all they need
to do is to recursively do this diff-capable "readdir" down the
subtree. For the data_diff part, I think we can simply model the way
rbd does. So, the whole export-diff process can be like this, first,
the "snapmirror" daemon do the one-directory metadata_diff on the
target directory, then it does data_diff on files that has been
modified between the two snapshots and meanwhile it does the
one-directory metadata_diff on subdirs that has been modified, and
then for each of those subdirs, the "snapmirror" daemon repeats the
above two steps.

For the import-diff part, I think we can go this way: first, apply
diffs of files in the target subtree, and then, do "setattr" all the
files and directories that has been modified between the two snapshots
to make their metadata exactly the same as their counter parts on the
source filesystem.

For the metadata_diff, I've already tried to implement a prototype:
https://github.com/xxhdx1985126/ceph/commit/d55ea8a23738268e19e7dd6a43bb1a89929e9d22
Please take a look if any of you guys have the time:-)

If this approach sounds reasonable to you, can I add a feature to the
Ceph issue tracker and go on implementing it? Thank you:-)

On 12 May 2018 at 11:34, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Fri, 11 May 2018, Xuehan Xu wrote:
>> > Given the description here, I think it would be much easier and more
>> > efficient to implement sync at the CephFS level. The simplest option
>> > is a "smart rsync" that is aware of CephFS' recursive statistics. You
>> > then start at the root, look at each file or folder to see if its
>> > (recursive) timestamp is newer than your last sync[1], and if it is,
>> > you check out the children. Do an rsync on each individual file, and
>> > profit!
>> >
>> > Now you have some choices to make on the consistency model — as Sage
>> > says, you may want to take snapshots and do this work on a snapshot so
>> > files are all consistent with each other, though that does impose more
>> > work to take and clean up snapshots. Or you may just do it on a
>> > file-by-file basis, in which case it's even easier to just do this
>> > from a normal client. (Which is not to say you have to; you could also
>> > build it inside the MDS or in some special privileged daemon or
>> > something.)
>> >
>> > I find an approach like this attractive for a few reasons.
>> > First of all, it's less invasive to the core code, which means it
>> > interacts with fewer parts of the system, is easier to maintain in
>> > itself, and doesn't impose costs on maintaining other parts of Ceph.
>> > Second, CephFS is unlike RADOS in that it has a centralized,
>> > consistent set of metadata for tracking all its data. We can take
>> > advantage of that metadata (or *add* to it!) to make the job of
>> > synchronizing easier. In RADOS, we are stuck with running algorithms
>> > that parallelize, and being very careful to minimize the central
>> > coordination points. That's good for scale, but very bad for ease of
>> > understanding and development.
>> > -Greg
>> > [1]: This is *slightly* trickier than I make it sound to get right, as
>> > the rstats are updated *lazily*. So you may run a sync at 16:00:00 and
>> > miss a file down the tree that was updated at 15:59:10 because that
>> > change hasn't gotten all the way to the root. You'd probably want a
>> > database associated with the sync daemon that tracks the timestamps
>> > you saw at the previous sync. If you wanted to build this into the MDS
>> > system with its own metadata you could look at how forward scrub works
>> > for a model.
>>
>> I think this "smart rsync" should be an appropriate way for our
>> current need. And I think that maybe we can reuse the snapshot
>> mechanism in this "smart sync". When we find some files has been
>> modified, we make snapshots only for those files that are going to be
>> copied, and apply the diffs between snapshots to the other clusters.
>> In this way, I think we should be able to save those bandwidth
>> (network and disk) used for copying unmodified area of files. Is this
>> right?
>
> Yes, in that if you teach rsync about cephfs recursive stats, you can just
> as easily run it on a snapshot as on the live data.  One other comment,
> too: as Greg mentioned the cephfs rstats are somewhat lazily propagated to
> the client.  However, I can also imagine, if we go down this path, that we
> might work out a way for the client to request the latest, consistent
> rstats from the MDS.  (I think we may already have this for the snapshot
> view, or be quite close to it--Zheng would know best as he's been
> actively working on this code.)
>
> In any case, I think it would be awesome to try to modify normal rsync
> in a way that makes it notice it is on a CephFS mount, and if so, use the
> rctime to avoid checking subdirectories that have not changed... either
> via a special flag, or automatically based on the fs time returned by
> statfs.
>
>> Finally, I think, although we don't have an urgent need for an "ops
>> replication" mechanism in CephFS for now, we should take precautions.
>> So basically, I think maybe we can implement the final "top-level"
>> cephfs replication in three steps: first, we implement a "smart
>> rsync"; then a replication mechanism with point-in-time consistency at
>> file level; and finally, when we have all the man power and the
>> resources, a replication mechanism with system-wide point-in-time
>> consistency. Does this sound reasonable to you? Thanks.
>
> I'm not immediately sure the 2 and 3 steps would look like from this
> description, but agree that rsync should be step 1.
>
> I have one other option, though, that may or not be what you had in mind.
> Let's call is the "snapmirror" mode, since it's roughly what the
> netapp feature with that name is doing:
>
> 1- Automatically take snapshot S of the source filesystem.
>
> 2- Do an efficiently traversal of the new snapshot S vs the previous one
> (S-1) and send the diff to the remote site.  Once the whole diff is
> transferred, apply it to the remote fs.  (Or, on the remote site, apply
> the changes directly.  If we fail to do the whole thing, roll back to
> snapshot S-1).
>
> 3- take snapshot S on remote fs
>
> 4- remove snapshot S-1 on local and remote fs
>
> 5- wait N minutes, then repeat
>
> I think the key here, in the end, is to make the setup/configuration of
> this sort of thing easy to understand, and to provide an easy, transparent
> view of the current sync status.
>
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html