Re: A simple erasure-coding question about redundance

Eugen Block <eblock@xxxxxx> · Fri, 27 Aug 2021 10:55:31 +0000

Hi,

1. two disks would fail where both failed disks are not on the same  
host? I think ceph would be able to find a PG distributed across all  
hosts avoiding the two failed disks, so ceph would be able to repair  
and reach a healthy status after a while?

yes, if there is enough disk space and no other OSDs fail during that  
time then ceph would recover successfully and the PGs would still be  
available.

2. Two complete hosts would fail say because of broken power  
supplies. In this case ceph would no longer be able to repair the  
damage because there are no two more "free" remaining hosts to  
satisfy the 4+2 rule (with redundancy on host level). So data would  
not be lost but the cluster might stop delivering data and would be  
unable to repair and thus would also be unable to become healthy  
again?

Correct, your cluster would be in a degraded state until you have 6  
hosts again. But keep in mind that with EC your pool's min_size is  
usually k+1 so in your example your cluster would stop serving I/O the  
moment the second host fails.
The best choice would be if k+m would be smaller than the number of  
available hosts so your cluster can recover. If you want to be able to  
recover from two failed hosts you should respectively take that into  
consideration when choosing k and m.

Regards,
Eugen

Zitat von Rainer Krienke <krienke@xxxxxxxxxxxxxx>:

Hello,

recently I thought about erasure coding and how to set k+m in a  
useful way also taking into account the number of hosts available  
for ceph. Say I would have this setup:

The cluster has 6 hosts and I want to allow two *hosts* to fail  
without loosing data. So I might choose k+m as 4+2 with redundancy  
at host level, but isn't this a little unwise?

What would happen if:

1. two disks would fail where both failed disks are not on the same  
host? I think ceph would be able to find a PG distributed across all  
hosts avoiding the two failed disks, so ceph would be able to repair  
and reach a healthy status after a while?

2. Two complete hosts would fail say because of broken power  
supplies. In this case ceph would no longer be able to repair the  
damage because there are no two more "free" remaining hosts to  
satisfy the 4+2 rule (with redundancy on host level). So data would  
not be lost but the cluster might stop delivering data and would be  
unable to repair and thus would also be unable to become healthy  
again?

Right or wrong?

Thanks a lot
Rainer
--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html,     Fax:  
+49261287 1001312
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx