rbd-mirror and DR test

Kamil Madac <kamil.madac@xxxxxxxxx> · Mon, 18 Sep 2023 15:36:17 +0200

One of our customers is currently facing a challenge in testing our
disaster recovery (DR) procedures on a pair of Ceph clusters (Quincy
version 17.2.5).

Our issue revolves around the need to resynchronize data after
conducting a DR procedure test. In small-scale scenarios, this may not
be a significant problem. However, when dealing with terabytes of
data, it becomes a considerable challenge.

In a typical DR procedure, there are two sites, Site A and Site B. The
process involves demoting Site A and promoting Site B, followed by the
reverse operation to ensure data resynchronization. However, our
specific challenge lies in the fact that, in our case:

- Site A is running and serving production traffic, Site B is just for
DR purposes.
- Network connectivity between Site A and Site B is deliberately disrupted.
- A "promote" operation is enforced (--force) on Site B, creating a
split-brain situation.
- Data access and modifications are performed on Site B during this state.
- To revert to the original configuration, we must demote Site B, but
the only way to re-establish RBD mirroring is by forcing a full
resynchronization, essentially recopying the entire dataset.

Given these circumstances, we are interested in how to address this
challenge efficiently, especially when dealing with large datasets
(TBs of data). Are there alternative approaches, best practices, or
recommendations such that we won't need to fully resync site A to site
B in order to reestablish rbd-mirror?

Thank you very much for any advice.

Kamil Madac
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx