About RADOS level replication

Xuehan Xu <xxhdx1985126@xxxxxxxxx> · Thu, 3 Aug 2017 04:05:01 +0800

Hi, Sage and Joao

I think there were something I didn't make clear just now, sorry.

About the second issue, by "insuring consistency of cross-object
operations", I mean:

     Say, a rbd write X involves two rados object A and B, and turn
them into A1 and B1 respectively, and a following rbd write Y turn
them in to A2 and B2, when these two operations are replicated, the
result on the backup cluster can't be A1, B2 or A2, B1.

So, just like Joao said, this is all about ORDERING.

My approach is like this: since RADOS can guarantee the order of OPs
coming from the same client and targeting the same object, we make
"repop"s within the same rbd operation forwarded to the same
intermediate node, and intermediate node forward these "repop"s to the
backup cluster on two conditions: 1) all "repop"s within the same rbd
operation arrived at the intermediate node; 2) all rbd operations, the
id of which are less than that of the current rbd operation(the id is
a monotonously increasing integer that uniquely identifies a rbd
operation, the order of the id indicates the order they are created by
the librbd client), are sent to the backup cluster(not replicated, or
"ondisk" on the backup cluster).

With these two constrains, I think we can insure the order of rbd
operations. The first condition makes sure that all "repop"s are all
sent to the backup cluster, or none of them are sent, which can insure
the consistency of the resulting rbd image if the master cluster
crashes when only part of a rbd operation are forwarded to the
intermediate node. The second condition can make sure that rbd
operations are replicated to the backup cluster in the order that they
are created by librbd clients. And since RADOS can guarantee the order
of OPs coming from the same client and targeting the same object, we
don't have to replicate a rbd operation after its ancestor is
replicated, only after it's issued to the backup cluster is enough. I
think this should be able to preserve the throughput of the
inter-cluster replication procedure.

The detail of this approach is shown in Algorithm 17.

And since we are implementing this at the RADOS level, we shouldn't
directly process "rbd" operations here. So, I think we should involve
the concept "object set" to adapt to the concepts of the upper level
system like "rbd image".

I don't know if I'm considering this in the right way, and I'm looking
forward to your opinion. Thanks very much:-)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html