Re: About RADOS level replication

Xuehan Xu <xxhdx1985126@xxxxxxxxx> · Fri, 11 May 2018 20:47:52 -0400

>>> I think the key question here is what kind of consistency we are
>>> looking for in the replica/backup cluster.  At the block layer, nothing
>>> except point-in-time consistency really works, because there is a local
>>> file system sitting on top with specific expectations about consistency.
>>> For a file system, it really depends on what the user/application is
>>> storing there.  If it's something like a database then the same
>>> consistency requirements apply, but if it's home directories, or a
>>> document archive, or a bunch of rendering assets, or probably most other
>>> random things, then probably something like point-in-time consistency for
>>> individual files is sufficient.
>>>
>>> The other question is about the latency.  A simple strategy might be to
>>> periodically spanshot the entire cephfs file system (say, every hour) and
>>> efficiently sync things across.  Or even every 10 minutes.  If we pushed
>>> this hard enough, we could probably get down to ~1 minute intervals?
>>> Might be worth a look.  On the receiving end we'd need to sort out how to
>>> stage the complete set of changes before applying them.  (I'm not certain
>>> but I think this is roughly how NetApp's SnapMirror feature works.. I
>>> think the lag is usually measured in minutes.)

Hi, sage and greg. Actually, until now, we haven't seen any
requirement for a system-wide point-in-time consistency. I think
point-in-time consistency for individual files should be enough:-)
About the latency, most of our clients want there synchronization
latency within 1 minute.

>>> I've seen enterprise systems that go to extreme lengths to have very
>>> strong volume-wide consistency for their async replication; other systems
>>> are pretty ad hoc and opportunistic (think rsync in a loop).  Both models
>>> seem to have happy users.
>>>
>>> Implementing this at the RADOS pool level means that we are stuck with
>>> providing point-in-time consistency, because things on top (cephfs in this
>>> case) expect it.  If we go up a layer we have more flexibility.
>>>
>>> Also, doing it a layer up means we can consider other models: what about
>>> active/active writes?  If we have a last-writer-wins and/or other conflict
>>> resolution approach you could have active users on both cephfs replicas.
>>> Or what about 1:N replication?
>>>
>>>> at the CephFS layer, but as you and other people said, it would double
>>>> all the I/O, which could lead to performance degradation. So we want
>>>> to find a way to accomplish the replication without severely damaging
>>>> the performance of the main cluster
>
> Can you expand more on this "double all the I/O"? Were you thinking
> about doing full data journaling the way rbd-mirror works?

Yes, we thought that if we implement the replication at CephFS layer,
an approach like rbd-mirror maybe needed.

> Given the description here, I think it would be much easier and more
> efficient to implement sync at the CephFS level. The simplest option
> is a "smart rsync" that is aware of CephFS' recursive statistics. You
> then start at the root, look at each file or folder to see if its
> (recursive) timestamp is newer than your last sync[1], and if it is,
> you check out the children. Do an rsync on each individual file, and
> profit!
>
> Now you have some choices to make on the consistency model — as Sage
> says, you may want to take snapshots and do this work on a snapshot so
> files are all consistent with each other, though that does impose more
> work to take and clean up snapshots. Or you may just do it on a
> file-by-file basis, in which case it's even easier to just do this
> from a normal client. (Which is not to say you have to; you could also
> build it inside the MDS or in some special privileged daemon or
> something.)
>
> I find an approach like this attractive for a few reasons.
> First of all, it's less invasive to the core code, which means it
> interacts with fewer parts of the system, is easier to maintain in
> itself, and doesn't impose costs on maintaining other parts of Ceph.
> Second, CephFS is unlike RADOS in that it has a centralized,
> consistent set of metadata for tracking all its data. We can take
> advantage of that metadata (or *add* to it!) to make the job of
> synchronizing easier. In RADOS, we are stuck with running algorithms
> that parallelize, and being very careful to minimize the central
> coordination points. That's good for scale, but very bad for ease of
> understanding and development.
> -Greg
> [1]: This is *slightly* trickier than I make it sound to get right, as
> the rstats are updated *lazily*. So you may run a sync at 16:00:00 and
> miss a file down the tree that was updated at 15:59:10 because that
> change hasn't gotten all the way to the root. You'd probably want a
> database associated with the sync daemon that tracks the timestamps
> you saw at the previous sync. If you wanted to build this into the MDS
> system with its own metadata you could look at how forward scrub works
> for a model.

I think this "smart rsync" should be an appropriate way for our
current need. And I think that maybe we can reuse the snapshot
mechanism in this "smart sync". When we find some files has been
modified, we make snapshots only for those files that are going to be
copied, and apply the diffs between snapshots to the other clusters.
In this way, I think we should be able to save those bandwidth
(network and disk) used for copying unmodified area of files. Is this
right?

Finally, I think, although we don't have an urgent need for an "ops
replication" mechanism in CephFS for now, we should take precautions.
So basically, I think maybe we can implement the final "top-level"
cephfs replication in three steps: first, we implement a "smart
rsync"; then a replication mechanism with point-in-time consistency at
file level; and finally, when we have all the man power and the
resources, a replication mechanism with system-wide point-in-time
consistency. Does this sound reasonable to you? Thanks.

On 4 May 2018 at 13:57, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Fri, May 4, 2018 at 4:16 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
>>> Hi Xuehan,
>>>
>>> On Fri, 4 May 2018, Xuehan Xu wrote:
>>>> Hi, sage.
>>>>
>>>> I'm really sorry for bothering you. But we really want to know your
>>>> opinion about our motives for implementing a RADOS level replication,
>>>> so I think I should fully illustrate them to you.
>>>
>>> It is no bother at all--thanks for reaching out!  Do you mind if I add
>>> ceph-devel to the thread?
>>
>> Of course not, +devel:-)
>>
>>>
>>>> The main reason is that we have a need to replicate our data in CephFS
>>>> between different data centers, because some of our clients want their
>>>> data produced in one cluster be replicated to other clusters in other
>>>> data centers quickly. We do considered implementing such a mechanism
>>>
>>> I think the key question here is what kind of consistency we are
>>> looking for in the replica/backup cluster.  At the block layer, nothing
>>> except point-in-time consistency really works, because there is a local
>>> file system sitting on top with specific expectations about consistency.
>>> For a file system, it really depends on what the user/application is
>>> storing there.  If it's something like a database then the same
>>> consistency requirements apply, but if it's home directories, or a
>>> document archive, or a bunch of rendering assets, or probably most other
>>> random things, then probably something like point-in-time consistency for
>>> individual files is sufficient.
>>>
>>> The other question is about the latency.  A simple strategy might be to
>>> periodically spanshot the entire cephfs file system (say, every hour) and
>>> efficiently sync things across.  Or even every 10 minutes.  If we pushed
>>> this hard enough, we could probably get down to ~1 minute intervals?
>>> Might be worth a look.  On the receiving end we'd need to sort out how to
>>> stage the complete set of changes before applying them.  (I'm not certain
>>> but I think this is roughly how NetApp's SnapMirror feature works.. I
>>> think the lag is usually measured in minutes.)
>>>
>>> I've seen enterprise systems that go to extreme lengths to have very
>>> strong volume-wide consistency for their async replication; other systems
>>> are pretty ad hoc and opportunistic (think rsync in a loop).  Both models
>>> seem to have happy users.
>>>
>>> Implementing this at the RADOS pool level means that we are stuck with
>>> providing point-in-time consistency, because things on top (cephfs in this
>>> case) expect it.  If we go up a layer we have more flexibility.
>>>
>>> Also, doing it a layer up means we can consider other models: what about
>>> active/active writes?  If we have a last-writer-wins and/or other conflict
>>> resolution approach you could have active users on both cephfs replicas.
>>> Or what about 1:N replication?
>>>
>>>> at the CephFS layer, but as you and other people said, it would double
>>>> all the I/O, which could lead to performance degradation. So we want
>>>> to find a way to accomplish the replication without severely damaging
>>>> the performance of the main cluster
>
> Can you expand more on this "double all the I/O"? Were you thinking
> about doing full data journaling the way rbd-mirror works?
>
> Given the description here, I think it would be much easier and more
> efficient to implement sync at the CephFS level. The simplest option
> is a "smart rsync" that is aware of CephFS' recursive statistics. You
> then start at the root, look at each file or folder to see if its
> (recursive) timestamp is newer than your last sync[1], and if it is,
> you check out the children. Do an rsync on each individual file, and
> profit!
>
> Now you have some choices to make on the consistency model — as Sage
> says, you may want to take snapshots and do this work on a snapshot so
> files are all consistent with each other, though that does impose more
> work to take and clean up snapshots. Or you may just do it on a
> file-by-file basis, in which case it's even easier to just do this
> from a normal client. (Which is not to say you have to; you could also
> build it inside the MDS or in some special privileged daemon or
> something.)
>
> I find an approach like this attractive for a few reasons.
> First of all, it's less invasive to the core code, which means it
> interacts with fewer parts of the system, is easier to maintain in
> itself, and doesn't impose costs on maintaining other parts of Ceph.
> Second, CephFS is unlike RADOS in that it has a centralized,
> consistent set of metadata for tracking all its data. We can take
> advantage of that metadata (or *add* to it!) to make the job of
> synchronizing easier. In RADOS, we are stuck with running algorithms
> that parallelize, and being very careful to minimize the central
> coordination points. That's good for scale, but very bad for ease of
> understanding and development.
> -Greg
> [1]: This is *slightly* trickier than I make it sound to get right, as
> the rstats are updated *lazily*. So you may run a sync at 16:00:00 and
> miss a file down the tree that was updated at 15:59:10 because that
> change hasn't gotten all the way to the root. You'd probably want a
> database associated with the sync daemon that tracks the timestamps
> you saw at the previous sync. If you wanted to build this into the MDS
> system with its own metadata you could look at how forward scrub works
> for a model.
>
> , and we found that if we can reuse
>>>> the OSD journal as a cache for the replication, we can probably
>>>> achieve that goal. And also, we noticed that the OSD journal only
>>>> exists in FileStore, but we thought that, if it's worth it, we can
>>>> implement a similar journal in other ObjectStores like BlueStore.
>>>> Although implementing such a journal at the ObjectStore level also
>>>> double the I/O, but we thought it should be better than implementing
>>>> it at a higher level like CephFS, since it avoids doubling I/O
>>>> processing steps between ObjectStore level and CephFS level, and we
>>>> thought those processing steps could be expensive when the underlying
>>>> disk is a fast SSD or NVMe device.
>>>
>>> This is a good point.  Another approach we've thought about a bit is
>>> exposing the CLONE operation and using that to construct an explicit
>>> journal object.  E.g., a write operation would write to object A and then
>>> clone that same data range to the end of a log object as well.  There's
>>> still some additional writes for the description of the IO in the log
>>> file, but in the bulk/streaming write case, we mostly don't write the
>>> data twice.
>>>
>>>> Also,  I found that, about our design, there were something that I
>>>> didn't make myself clear during the CDM yesterday.
>>>>
>>>> The first is that, in our design, the main cluster doesn't have to
>>>> wait infinitely if the backup cluster goes down or something like that
>>>> happens. Only the replication needs to be suspended. During this
>>>> suspension, objects that are modified are marked as
>>>> need-full-replication, and when the replication resumes, these objects
>>>> will be fully copied to the backup cluster. It's only when an object
>>>> is undertaking a full copy does those ops targeting it need to wait,
>>>> this is like the wait of ops when their target objects are being
>>>> recovered or backfilled.
>>>
>>> I see... This makes sense, although I worry that it would be a bit reason
>>> about and debug because it's operating/coordinating at a couple different
>>> layers of the stack.
>>>
>>>> The second is that, in our online clusters, we configured our OSD
>>>> journal space to 20GB. So we though if we can use 50% of it for the
>>>> replication, it should be able to provide some tolerance for the
>>>> jitter of the bandwidth of the replication.
>>>>
>>>> And with regard to flexibility of the replication, we also considered
>>>> some of that. For example, in our design, the replication of an object
>>>> is triggered by an op with its "need_replicate" field set to true, so
>>>> that the user can decide whether their data needs the replication.
>>>>
>>>
>>> Thanks for being open to a discussion, but also don't assume I have the
>>> right answer either :).  For something like this I there are lots of
>>> answers, and which is best is a balance a bunch of different things.
>>>
>>> One way or another, I *am* very interested in getting a multi-site CephFS
>>> replication implementation of some sort in place.  Exactly how we do it
>>> and what the properties are is still a bit of an open question, though, so
>>> this is perfect timing--and I'm eager to hear what requirements you see to
>>> help choose an approach!
>>>
>>> Thanks,
>>> sage
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html