Re: About RADOS level replication

Xuehan Xu <xxhdx1985126@xxxxxxxxx> · Fri, 4 May 2018 07:16:47 -0400



> Hi Xuehan,
>
> On Fri, 4 May 2018, Xuehan Xu wrote:
>> Hi, sage.
>>
>> I'm really sorry for bothering you. But we really want to know your
>> opinion about our motives for implementing a RADOS level replication,
>> so I think I should fully illustrate them to you.
>
> It is no bother at all--thanks for reaching out!  Do you mind if I add
> ceph-devel to the thread?

Of course not, +devel:-)

>
>> The main reason is that we have a need to replicate our data in CephFS
>> between different data centers, because some of our clients want their
>> data produced in one cluster be replicated to other clusters in other
>> data centers quickly. We do considered implementing such a mechanism
>
> I think the key question here is what kind of consistency we are
> looking for in the replica/backup cluster.  At the block layer, nothing
> except point-in-time consistency really works, because there is a local
> file system sitting on top with specific expectations about consistency.
> For a file system, it really depends on what the user/application is
> storing there.  If it's something like a database then the same
> consistency requirements apply, but if it's home directories, or a
> document archive, or a bunch of rendering assets, or probably most other
> random things, then probably something like point-in-time consistency for
> individual files is sufficient.
>
> The other question is about the latency.  A simple strategy might be to
> periodically spanshot the entire cephfs file system (say, every hour) and
> efficiently sync things across.  Or even every 10 minutes.  If we pushed
> this hard enough, we could probably get down to ~1 minute intervals?
> Might be worth a look.  On the receiving end we'd need to sort out how to
> stage the complete set of changes before applying them.  (I'm not certain
> but I think this is roughly how NetApp's SnapMirror feature works.. I
> think the lag is usually measured in minutes.)
>
> I've seen enterprise systems that go to extreme lengths to have very
> strong volume-wide consistency for their async replication; other systems
> are pretty ad hoc and opportunistic (think rsync in a loop).  Both models
> seem to have happy users.
>
> Implementing this at the RADOS pool level means that we are stuck with
> providing point-in-time consistency, because things on top (cephfs in this
> case) expect it.  If we go up a layer we have more flexibility.
>
> Also, doing it a layer up means we can consider other models: what about
> active/active writes?  If we have a last-writer-wins and/or other conflict
> resolution approach you could have active users on both cephfs replicas.
> Or what about 1:N replication?
>
>> at the CephFS layer, but as you and other people said, it would double
>> all the I/O, which could lead to performance degradation. So we want
>> to find a way to accomplish the replication without severely damaging
>> the performance of the main cluster, and we found that if we can reuse
>> the OSD journal as a cache for the replication, we can probably
>> achieve that goal. And also, we noticed that the OSD journal only
>> exists in FileStore, but we thought that, if it's worth it, we can
>> implement a similar journal in other ObjectStores like BlueStore.
>> Although implementing such a journal at the ObjectStore level also
>> double the I/O, but we thought it should be better than implementing
>> it at a higher level like CephFS, since it avoids doubling I/O
>> processing steps between ObjectStore level and CephFS level, and we
>> thought those processing steps could be expensive when the underlying
>> disk is a fast SSD or NVMe device.
>
> This is a good point.  Another approach we've thought about a bit is
> exposing the CLONE operation and using that to construct an explicit
> journal object.  E.g., a write operation would write to object A and then
> clone that same data range to the end of a log object as well.  There's
> still some additional writes for the description of the IO in the log
> file, but in the bulk/streaming write case, we mostly don't write the
> data twice.
>
>> Also,  I found that, about our design, there were something that I
>> didn't make myself clear during the CDM yesterday.
>>
>> The first is that, in our design, the main cluster doesn't have to
>> wait infinitely if the backup cluster goes down or something like that
>> happens. Only the replication needs to be suspended. During this
>> suspension, objects that are modified are marked as
>> need-full-replication, and when the replication resumes, these objects
>> will be fully copied to the backup cluster. It's only when an object
>> is undertaking a full copy does those ops targeting it need to wait,
>> this is like the wait of ops when their target objects are being
>> recovered or backfilled.
>
> I see... This makes sense, although I worry that it would be a bit reason
> about and debug because it's operating/coordinating at a couple different
> layers of the stack.
>
>> The second is that, in our online clusters, we configured our OSD
>> journal space to 20GB. So we though if we can use 50% of it for the
>> replication, it should be able to provide some tolerance for the
>> jitter of the bandwidth of the replication.
>>
>> And with regard to flexibility of the replication, we also considered
>> some of that. For example, in our design, the replication of an object
>> is triggered by an op with its "need_replicate" field set to true, so
>> that the user can decide whether their data needs the replication.
>>
>
> Thanks for being open to a discussion, but also don't assume I have the
> right answer either :).  For something like this I there are lots of
> answers, and which is best is a balance a bunch of different things.
>
> One way or another, I *am* very interested in getting a multi-site CephFS
> replication implementation of some sort in place.  Exactly how we do it
> and what the properties are is still a bit of an open question, though, so
> this is perfect timing--and I'm eager to hear what requirements you see to
> help choose an approach!
>
> Thanks,
> sage
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html