One of our customers is currently facing a challenge in testing our disaster recovery (DR) procedures on a pair of Ceph clusters (Quincy version 17.2.5). Our issue revolves around the need to resynchronize data after conducting a DR procedure test. In small-scale scenarios, this may not be a significant problem. However, when dealing with terabytes of data, it becomes a considerable challenge. In a typical DR procedure, there are two sites, Site A and Site B. The process involves demoting Site A and promoting Site B, followed by the reverse operation to ensure data resynchronization. However, our specific challenge lies in the fact that, in our case: - Site A is running and serving production traffic, Site B is just for DR purposes. - Network connectivity between Site A and Site B is deliberately disrupted. - A "promote" operation is enforced (--force) on Site B, creating a split-brain situation. - Data access and modifications are performed on Site B during this state. - To revert to the original configuration, we must demote Site B, but the only way to re-establish RBD mirroring is by forcing a full resynchronization, essentially recopying the entire dataset. Given these circumstances, we are interested in how to address this challenge efficiently, especially when dealing with large datasets (TBs of data). Are there alternative approaches, best practices, or recommendations such that we won't need to fully resync site A to site B in order to reestablish rbd-mirror? Thank you very much for any advice. Kamil Madac _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx