Re: recovery for node disaster

Robert Sander <r.sander@xxxxxxxxxxxxxxxxxxx> · Mon, 13 Feb 2023 09:33:40 +0100

Am 13.02.23 um 06:31 schrieb farhad kh:

Is it possible to recover data when two nodes with all physical disks are
lost for any reason?

You have one copy of each object on each node and each node runs a MON.

If two nodes fail then the cluster will cease to function as the 
remaining MON will not be able to gain quorum.

In this worst case you would need to manually edit the MON map and 
remove the two failed MONs. The remaining MON will then be "lonely" and 
will be able to reach quorum with itself. The cluster will work again.

In this moment the data will be available again, but read-only.

This is because there are less than "min_size" object copies available.

The next step would be to add new nodes (and MONs). Reduce min_size for 
each pool to 1 to tell the cluster that it should be recover from the 
last remaining copy.

After that has been done increase min_size to 2 again.

While recovery runs there is an increased risk to lose data when a disk 
in the remaining node fails.

What is the maximum number of fault tolerance for the cluster?

Such a cluster can stand the loss of two nodes without data loss. If no 
disk in the remaining node fails.

To increase fault tolerance you need to streamline your processes and 
replace a failed node immediately before the next one fails. In such 
small clusters each consecutive failure can lead to data loss.

Best would be to add more nodes.

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx