Is there any way to obtain the maximum number of node failure in ceph without data loss?

Jerry Lee <leisurelysw24@xxxxxxxxx> · Fri, 23 Jul 2021 15:42:10 +0800

Hello,

I would like to know the maximum number of node failures for a EC8+3
pool in a 12-node cluster with 3 OSDs in each node.  The size and
min_size of the EC8+3 pool is configured as 11 and 8, and OSDs of each
PG are selected by host.  When there is no node failure, the maximum
number of node failures is 3, right?  After unplugging a OSD (osd.14)
in the cluster, I check the PG acting set changes and one of the
results is shown as below:

T0:
[15,31,11,34,28,1,8,26,14,19,5]

T1: after unplugging a OSD (osd.14) and recovery started
[15,31,11,34,28,1,8,26,NONE,19,5]

T2:
[15,31,11,34,21,1,8,26,19,29,5]

T3:
[15,31,11,34,NONE,1,8,26,NONE,NONE,5]

T4: recovery was done
[15,31,11,34,21,1,8,26,19,29,5]

For the PG, 3 OSD peers changed during the recovery progress
([_,_,_,_,28->21,_,_,_,14->19,19->29,_]).  It seems that min_size (8)
of chunks of the EC8+3 pool are kept during recovery.  Does it mean
that no more node failures are bearable during T3 to T4?  Can we
calculate the maximum number of node failures by examining all the
acting sets of the PGs?  Is there some simple way to obtain such
information?  Any ideas and feedback are appreciated, thanks!

- Jerry
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx