Re: Erasure Coding failure domain (again)

Paul Emmerich <paul.emmerich@xxxxxxxx> · Wed, 10 Apr 2019 20:09:58 +0200

On Wed, Apr 10, 2019 at 11:12 AM Christian Balzer <chibi@xxxxxxx> wrote:
>
>
> Hello,
>
> Another thing that crossed my mind aside from failure probabilities caused
> by actual HDDs dying is of course the little detail that most Ceph
> installations will have have WAL/DB (journal) on SSDs, the most typical
> ratio being 1:4.

Unfortunately the ratios seen "in the wild" seems to be a lot higher.
I've seen 1:100 and 1:60 which is a obviously a really bad idea. But
1:24 is also quite common.

1:12 is quite common: 2 NVMe disks in 24 bay chassis. I think that's
perfectly reasonable.

Paul

> And given the current thread about compaction killing pure HDD OSDs,
> something you may _have_ to do.
>
> So if you get unlucky and a SSD dies 4 OSDs are irrecoverably lost, unlike
> a dead node that can be recovered.
> Combine that with the background noise of HDDs failing, things got just
> quite a bit scarier.
>
> And if you have a "crap firmware of the week" situation like experienced
> with several people here, you're even more like to wind up in trouble very
> fast.
>
> This is of course all something people do (or should know), I'm more
> wondering how to model it to correctly asses risks.
>
> Christian
>
> On Wed, 3 Apr 2019 10:28:09 +0900 Christian Balzer wrote:
>
> > On Tue, 2 Apr 2019 19:04:28 +0900 Hector Martin wrote:
> >
> > > On 02/04/2019 18.27, Christian Balzer wrote:
> > > > I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a replica 2
> > > > pool with 1024 PGs.
> > >
> > > (20 choose 2) is 190, so you're never going to have more than that many
> > > unique sets of OSDs.
> > >
> > And this is why one shouldn't send mails when in a rush, w/o fully groking
> > the math one was just given.
> > Thanks for setting me straight.
> >
> > > I just looked at the OSD distribution for a replica 3 pool across 48
> > > OSDs with 4096 PGs that I have and the result is reasonable. There are
> > > 3782 unique OSD tuples, out of (48 choose 3) = 17296 options. Since this
> > > is a random process, due to the birthday paradox, some duplicates are
> > > expected after only the order of 17296^0.5 = ~131 PGs; at 4096 PGs
> > > having 3782 unique choices seems to pass the gut feeling test. Too lazy
> > > to do the math closed form, but here's a quick simulation:
> > >
> > >  >>> len(set(random.randrange(17296) for i in range(4096)))
> > > 3671
> > >
> > > So I'm actually slightly ahead.
> > >
> > > At the numbers in my previous example (1500 OSDs, 50k pool PGs),
> > > statistically you should get something like ~3 collisions on average, so
> > > negligible.
> > >
> > Sounds promising.
> >
> > > > Another thing to look at here is of course critical period and disk
> > > > failure probabilities, these guys explain the logic behind their
> > > > calculator, would be delighted if you could have a peek and comment.
> > > >
> > > > https://www.memset.com/support/resources/raid-calculator/
> > >
> > > I'll take a look tonight :)
> > >
> > Thanks, a look at the Backblaze disk failure rates (picking the worst
> > ones) gives a good insight into real life probabilities, too.
> > https://www.backblaze.com/blog/hard-drive-stats-for-2018/
> > If we go with 2%/year, that's an average failure ever 12 days.
> >
> > Aside from how likely the actual failure rate is, another concern of
> > course is extended periods of the cluster being unhealthy, with certain
> > versions there was that "mon map will grow indefinitely" issue, other more
> > subtle ones might lurk still.
> >
> > Christian
> > > --
> > > Hector Martin (hector@xxxxxxxxxxxxxx)
> > > Public Key: https://mrcn.st/pub
> > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx         Rakuten Communications
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Rakuten Communications
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com