Re: ceph reliability in large RBD setups

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 9 Dec 2013 11:12:49 +0100

Hi Felix,
I've been running similar calculations recently. I've been using this
tool from Inktank to calculate RADOS reliabilities with different
assumptions:
  https://github.com/ceph/ceph-tools/tree/master/models/reliability

But I've also had similar questions about RBD (or any multi-part files
stored in RADOS) -- naively, a file/device stored in N objects would
be N times less reliable than a single object. But I hope there's an
error in that logic.
Cheers, Dan

On Sat, Dec 7, 2013 at 4:10 PM, Felix Schüren
<felix.schueren@xxxxxxxxxxxxx> wrote:
> Hi,
>
> I am trying to wrap my head around large RBD-on-RADOS clusters and their
> reliability and would love some community feedback.
>
> Firstly, for the RADOS-only case, reliability for a single object should
> be (only looking at node failures, assuming a MTTR of 1 day and a node
> MTBF of 20,000h (~2.3 years)):
>
> MTBF 20,000h == annualized failure rate of ~32%, broken down to a daily
> that means every day there is a ~0,09% chance for a single node to break
> down (assuming simplistically that daily failure rate = AFR/365)
>
> My chance of losing all object-holding nodes at the same time for the
> single object case is
> DFR^(number of replica), so:
> # rep   # prob. of total system failure
> 1       0,089033220%
> 2       0,000079269%
> 3       0,000000071%
> 4       0,00000000006%
>
> (though I think I need to take the number of nodes into question as well
> - the more nodes, the less likely it becomes that the single object peer
> nodes will crash simultaneously)
>
> that means even on hardware that has a high chance of failure, my single
> objects (when using 3 replica) should be fine - unsurprisingly, seeing
> as this is one of the design goals for RADOS.
>
> Now, let's take RBD into play. Using sufficiently large disks (assumed
> 10TB RBD disksize) and the default block size of 4MB, on a 10% filled
> disk (1TB written) we end up with 1TB/4MB = 250,000 objects. That means
> that every ceph OSD node participating in that disk's RBD pool has parts
> of this disk, so every OSD node failure means that this disk (and
> actually, all RBD disks since pretty much all of the RBD disks will have
> objects on every node) is now at risk of having blocks lost - my gut
> tells me there is a much higher risk of data loss for the RBD case vs
> the single object case, but maybe I am mistaken? Can one of you
> enlighten me with some probability calculation magic? Probably best to
> start with plain RADOS, then move into RBD territory. My fear is that
> really large (3000+ nodes) RBD clusters will become too risky to run,
> and I would love for someone to dispel my fear with math ;)
>
> Kind regards,
>
> Felix
>
> --
> Felix Schüren
> Senior Infrastructure Architect
> Host Europe Group - http://www.hosteuropegroup.com/
>
> Mail:   felix.schueren@xxxxxxxxxxxxxxxxxxx
> Tel:    +49 2203 1045 7350
> Mobile: +49 162 2323 988
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com