Re: Erasure Coding failure domain (again)

Hector Martin <hector@xxxxxxxxxxxxxx> · Wed, 10 Apr 2019 19:23:40 +0900

On 10/04/2019 18.11, Christian Balzer wrote:
> Another thing that crossed my mind aside from failure probabilities caused
> by actual HDDs dying is of course the little detail that most Ceph
> installations will have have WAL/DB (journal) on SSDs, the most typical
> ratio being 1:4. 
> And given the current thread about compaction killing pure HDD OSDs,
> something you may _have_ to do.
> 
> So if you get unlucky and a SSD dies 4 OSDs are irrecoverably lost, unlike
> a dead node that can be recovered.
> Combine that with the background noise of HDDs failing, things got just
> quite a bit scarier. 

Certainly, your failure domain should be at least host, and that changes
the math (even without considering whole-host failure).

Let's say you have 375 hosts and 4 OSDs per host, with the failure
domain correctly set to host. Same 50000 pool PGs as before. Now if 3
hosts die:

50000 / (375 choose 3) =~ 0.57% chance of data loss

This is equivalent to having 3 shared SSDs die.

If 3 random OSDs die in different hosts, the chances of data loss would
be 0.57% / (4^3) =~ 0.00896 % (1 in 4 chance per host that you hit the
OSD a PG actually lives in, and you need to hit all 3). This is
marginally higher than the ~ 0.00891% with uniformly distributed PGs,
because you've eliminated all sets of OSDs which share a host.

-- 
Hector Martin (hector@xxxxxxxxxxxxxx)
Public Key: https://mrcn.st/pub
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com