ceph reliability in large RBD setups

Felix Schüren <felix.schueren@xxxxxxxxxxxxx> · Sat, 07 Dec 2013 16:10:06 +0100

Hi,

I am trying to wrap my head around large RBD-on-RADOS clusters and their
reliability and would love some community feedback.

Firstly, for the RADOS-only case, reliability for a single object should
be (only looking at node failures, assuming a MTTR of 1 day and a node
MTBF of 20,000h (~2.3 years)):

MTBF 20,000h == annualized failure rate of ~32%, broken down to a daily
that means every day there is a ~0,09% chance for a single node to break
down (assuming simplistically that daily failure rate = AFR/365)

My chance of losing all object-holding nodes at the same time for the
single object case is
DFR^(number of replica), so:
# rep   # prob. of total system failure
1	0,089033220%
2	0,000079269%
3	0,000000071%
4	0,00000000006%

(though I think I need to take the number of nodes into question as well
- the more nodes, the less likely it becomes that the single object peer
nodes will crash simultaneously)

that means even on hardware that has a high chance of failure, my single
objects (when using 3 replica) should be fine - unsurprisingly, seeing
as this is one of the design goals for RADOS.

Now, let's take RBD into play. Using sufficiently large disks (assumed
10TB RBD disksize) and the default block size of 4MB, on a 10% filled
disk (1TB written) we end up with 1TB/4MB = 250,000 objects. That means
that every ceph OSD node participating in that disk's RBD pool has parts
of this disk, so every OSD node failure means that this disk (and
actually, all RBD disks since pretty much all of the RBD disks will have
objects on every node) is now at risk of having blocks lost - my gut
tells me there is a much higher risk of data loss for the RBD case vs
the single object case, but maybe I am mistaken? Can one of you
enlighten me with some probability calculation magic? Probably best to
start with plain RADOS, then move into RBD territory. My fear is that
really large (3000+ nodes) RBD clusters will become too risky to run,
and I would love for someone to dispel my fear with math ;)

Kind regards,

Felix

-- 
Felix Schüren
Senior Infrastructure Architect
Host Europe Group - http://www.hosteuropegroup.com/

Mail:   felix.schueren@xxxxxxxxxxxxxxxxxxx
Tel:    +49 2203 1045 7350
Mobile: +49 162 2323 988
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com