Re: Failure probability with largish deployments

Christian Balzer <chibi@xxxxxxx> · Fri, 20 Dec 2013 12:14:04 +0900



Hello,

On Thu, 19 Dec 2013 12:12:13 +0100 Mariusz Gronczewski wrote:

> Dnia 2013-12-19, o godz. 17:39:54
> Christian Balzer <chibi@xxxxxxx> napisał(a):
[snip]
> 
> 
> > So am I completely off my wagon here? 
> > How do people deal with this when potentially deploying hundreds of
> > disks in a single cluster/pool?
> > 
> > I mean, when we get too 600 disks (and that's just one rack full, OK,
> > maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> > servers (or 72 disk per 4U if you're happy with killing another drive
> > when replacing a faulty one in that Supermicro contraption), that
> > ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
> > mentioned up there.
> > 
> 
> That problem will only occur if you really want to have all those 600
> disks in one pool and it so happens that 3 drives in different servers
> unrecoverably die in same very short time interval, which is unlikely.
The likelihood of that is in that calculator, some more refined studies
and formulas can be found on the web as well. 
And as Gregory acknowledged, with large pools that probability becomes
significant. 

> But with 60 disks per enclosue, RAIDing them into 4-5 groups probably
> makes more sense than running 60 OSDs just from memory/cpu usage
> standpoint
> 
Yup, I would do that as well, if I were to deploy such a massive system.

> From my experience disks rarely "just die" often it's either starts
> to have bad blocks and write errors  or performance degrades and it
> starts spewing media errors (which usually means you can
> recover 90%+ data from it if you need to without using anything more
> than ddrescue). Which means ceph can access most of data for recovery
> and recover just those few missing blocks.
> 
I would certainly agree with the fact that with most disks you can see
them becoming marginal by watching smart output, unfortunately this is
quite a job to monitor with many disks and most of the time the on-disk
SMART algorithm will NOT trigger an impending failure status at all or with
sufficient warning time. And some disks really just drop dead, w/o any
warning at all.

> each pool consist of many PGs and to make PG fail all disks had to be
> hit so in worst case you will most likely just lose access to small part
> (that pg that out of 600 disks happened to be on those 3) of data, not
> everything that is on given array.
> 
Yes, is the one thing I'm still not 100% sure about in my understanding of
Ceph. 
In my scenario (VM volumes/images of 50GB to 2TB size). I would assume
them to be striped in such a way that there is more than a small impact
from a tripple failure.

> And again, that's only in case those disks die exactly at same moment,
> with no time to recovery. even 60 min between failures will let most of
> the data replicate. And in worst case, there is always data recovery
> service. And backups
> 
My design goal for all my systems is that backups are for times when
people do stupid things, as in deleted things they shouldn't have. The
actual storage should be reliable enough to survive anything reasonably
expected to happen. 
Also on what kind of storage server would you put backups for 60TB, my
planned initial capacity, another Ceph cluster? ^o^ 

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com