Re: Erasure code profile

jorpilo <jorpilo@xxxxxxxxx> · Wed, 25 Oct 2017 00:52:38 +0200

That's a pretty hard question, I don't think it would speed writes so much because you end writing the same amount of data but I think on a 4+4 setup re-building or serving data while a node is down will go faster and will use less resources because it has to rebuild a smallers chunks of data.
Another question, when you create a EC pool you also create a crush EC rule, so what would happen if you set a failure domain of node and on the rule you divide it by OSDs?  How do failure domaing and crush rule interact?

-------- Mensaje original --------
De: Oliver Humpage <oliver@xxxxxxxxxxxxxxx> 
Fecha: 24/10/17  10:32 p. m.  (GMT+01:00) 
Para: Karun Josy <karunjosy1@xxxxxxxxx> 
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> 
Asunto: Re: [ceph-users] Erasure code profile 

Consider a cluster of 8 OSD servers with 3 disks on each server. 

If I use a profile setting of k=5, m=3 and  ruleset-failure-domain=host ;

As far as I understand it can tolerate failure of 3 OSDs and 1 host, am I right ?

When setting up your pool, you specify a crush map which says what your "failure domain” is. You can think of a failure domain as "what’s the largest single thing that could fail and the cluster would still survive?”. By default this is a node (a server). Large clusters often use a rack instead.  Ceph places your data across the OSDs in your cluster so that if that large single thing (node or rack) fails, your data is still safe and available.

If you specify a single OSD (a disk) as your failure domain, then ceph might end up placing lots of data on different OSDs on the same node. This is a bad idea since if that node goes down you'll lose several OSDs, and so your data might not survive.

If you have 8 nodes, and erasure of 5+3, then with the default failure domain of a node your data will be spread across all 8 nodes (data chunks on 5 of them, and parity chunks on the other three). Therefore you could tolerate 3 whole nodes failing. You are right that 5+3 encoding will result in 1.6xdata disk usage.

If you were being pathological about minimising disk usage, I think you could in theory set a failure domain of an OSD, then use 8+2 encoding with a crush map that never used more than 2 OSDs in each node for a placement group. Then technically you could tolerate a node failure. I doubt anyone would recommend that though!

That said, here’s a question for others: say a cluster only has 4 nodes (each with many OSDs), would you use 2+2 or 4+4? Either way you use 2xdata space and could lose 2 nodes (assuming a proper crush map), but presumably the 4+4 would be faster and you could lose more OSDs?

Oliver.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com