Hi Felix, I've been running similar calculations recently. I've been using this tool from Inktank to calculate RADOS reliabilities with different assumptions: https://github.com/ceph/ceph-tools/tree/master/models/reliability But I've also had similar questions about RBD (or any multi-part files stored in RADOS) -- naively, a file/device stored in N objects would be N times less reliable than a single object. But I hope there's an error in that logic. Cheers, Dan On Sat, Dec 7, 2013 at 4:10 PM, Felix Schüren <felix.schueren@xxxxxxxxxxxxx> wrote: > Hi, > > I am trying to wrap my head around large RBD-on-RADOS clusters and their > reliability and would love some community feedback. > > Firstly, for the RADOS-only case, reliability for a single object should > be (only looking at node failures, assuming a MTTR of 1 day and a node > MTBF of 20,000h (~2.3 years)): > > MTBF 20,000h == annualized failure rate of ~32%, broken down to a daily > that means every day there is a ~0,09% chance for a single node to break > down (assuming simplistically that daily failure rate = AFR/365) > > My chance of losing all object-holding nodes at the same time for the > single object case is > DFR^(number of replica), so: > # rep # prob. of total system failure > 1 0,089033220% > 2 0,000079269% > 3 0,000000071% > 4 0,00000000006% > > (though I think I need to take the number of nodes into question as well > - the more nodes, the less likely it becomes that the single object peer > nodes will crash simultaneously) > > that means even on hardware that has a high chance of failure, my single > objects (when using 3 replica) should be fine - unsurprisingly, seeing > as this is one of the design goals for RADOS. > > Now, let's take RBD into play. Using sufficiently large disks (assumed > 10TB RBD disksize) and the default block size of 4MB, on a 10% filled > disk (1TB written) we end up with 1TB/4MB = 250,000 objects. That means > that every ceph OSD node participating in that disk's RBD pool has parts > of this disk, so every OSD node failure means that this disk (and > actually, all RBD disks since pretty much all of the RBD disks will have > objects on every node) is now at risk of having blocks lost - my gut > tells me there is a much higher risk of data loss for the RBD case vs > the single object case, but maybe I am mistaken? Can one of you > enlighten me with some probability calculation magic? Probably best to > start with plain RADOS, then move into RBD territory. My fear is that > really large (3000+ nodes) RBD clusters will become too risky to run, > and I would love for someone to dispel my fear with math ;) > > Kind regards, > > Felix > > -- > Felix Schüren > Senior Infrastructure Architect > Host Europe Group - http://www.hosteuropegroup.com/ > > Mail: felix.schueren@xxxxxxxxxxxxxxxxxxx > Tel: +49 2203 1045 7350 > Mobile: +49 162 2323 988 > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com