Re: Erasure Coding failure domain (again)

Christian Balzer <chibi@xxxxxxx> · Thu, 11 Apr 2019 11:16:22 +0900



Hello,

On Wed, 10 Apr 2019 20:09:58 +0200 Paul Emmerich wrote:

> On Wed, Apr 10, 2019 at 11:12 AM Christian Balzer <chibi@xxxxxxx> wrote:
> >
> >
> > Hello,
> >
> > Another thing that crossed my mind aside from failure probabilities caused
> > by actual HDDs dying is of course the little detail that most Ceph
> > installations will have have WAL/DB (journal) on SSDs, the most typical
> > ratio being 1:4.  
> 
> Unfortunately the ratios seen "in the wild" seems to be a lot higher.
> I've seen 1:100 and 1:60 which is a obviously a really bad idea. But
> 1:24 is also quite common.
> 
> 1:12 is quite common: 2 NVMe disks in 24 bay chassis. I think that's
> perfectly reasonable.
>
Given the numbers Hector provided in the mail just after this one (thanks
for that!), I'd be even less inclined to go that high.
The time for recovering 12 large OSDs (w/o service impact) is going to be
_significant_, increasing the likelihood of something else (2 somethings)
going bang in the mean time.
Cluster size of course plays a role here, too.

The highest ratio I ever considered was this baby with 6 NVMes and it felt
risky on a pure gut level:
https://www.supermicro.com/products/system/4U/6048/SSG-6048R-E1CR60N.cfm

Christian
 
> 
> Paul
> 
> > And given the current thread about compaction killing pure HDD OSDs,
> > something you may _have_ to do.
> >
> > So if you get unlucky and a SSD dies 4 OSDs are irrecoverably lost, unlike
> > a dead node that can be recovered.
> > Combine that with the background noise of HDDs failing, things got just
> > quite a bit scarier.
> >
> > And if you have a "crap firmware of the week" situation like experienced
> > with several people here, you're even more like to wind up in trouble very
> > fast.
> >
> > This is of course all something people do (or should know), I'm more
> > wondering how to model it to correctly asses risks.
> >
> > Christian
> >
> > On Wed, 3 Apr 2019 10:28:09 +0900 Christian Balzer wrote:
> >  
> > > On Tue, 2 Apr 2019 19:04:28 +0900 Hector Martin wrote:
> > >  
> > > > On 02/04/2019 18.27, Christian Balzer wrote:  
> > > > > I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a replica 2
> > > > > pool with 1024 PGs.  
> > > >
> > > > (20 choose 2) is 190, so you're never going to have more than that many
> > > > unique sets of OSDs.
> > > >  
> > > And this is why one shouldn't send mails when in a rush, w/o fully groking
> > > the math one was just given.
> > > Thanks for setting me straight.
> > >  
> > > > I just looked at the OSD distribution for a replica 3 pool across 48
> > > > OSDs with 4096 PGs that I have and the result is reasonable. There are
> > > > 3782 unique OSD tuples, out of (48 choose 3) = 17296 options. Since this
> > > > is a random process, due to the birthday paradox, some duplicates are
> > > > expected after only the order of 17296^0.5 = ~131 PGs; at 4096 PGs
> > > > having 3782 unique choices seems to pass the gut feeling test. Too lazy
> > > > to do the math closed form, but here's a quick simulation:
> > > >  
> > > >  >>> len(set(random.randrange(17296) for i in range(4096)))  
> > > > 3671
> > > >
> > > > So I'm actually slightly ahead.
> > > >
> > > > At the numbers in my previous example (1500 OSDs, 50k pool PGs),
> > > > statistically you should get something like ~3 collisions on average, so
> > > > negligible.
> > > >  
> > > Sounds promising.
> > >  
> > > > > Another thing to look at here is of course critical period and disk
> > > > > failure probabilities, these guys explain the logic behind their
> > > > > calculator, would be delighted if you could have a peek and comment.
> > > > >
> > > > > https://www.memset.com/support/resources/raid-calculator/  
> > > >
> > > > I'll take a look tonight :)
> > > >  
> > > Thanks, a look at the Backblaze disk failure rates (picking the worst
> > > ones) gives a good insight into real life probabilities, too.
> > > https://www.backblaze.com/blog/hard-drive-stats-for-2018/
> > > If we go with 2%/year, that's an average failure ever 12 days.
> > >
> > > Aside from how likely the actual failure rate is, another concern of
> > > course is extended periods of the cluster being unhealthy, with certain
> > > versions there was that "mon map will grow indefinitely" issue, other more
> > > subtle ones might lurk still.
> > >
> > > Christian  
> > > > --
> > > > Hector Martin (hector@xxxxxxxxxxxxxx)
> > > > Public Key: https://mrcn.st/pub
> > > >  
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > chibi@xxxxxxx         Rakuten Communications
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >  
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Rakuten Communications
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com