Re: Failure probability with largish deployments

Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxxx> · Thu, 19 Dec 2013 12:12:13 +0100

Dnia 2013-12-19, o godz. 17:39:54
Christian Balzer <chibi@xxxxxxx> napisał(a):

> 
> Hello,
> 
> In my "Sanity check" thread I postulated yesterday that to get the
> same redundancy and resilience for disk failures (excluding other
> factors) as my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node,
> 2 global hotspares, thus 4 OSDs) the "Ceph way" one would need need
> something like 6 nodes with 10 3TB HDs each, 3 way replication (to
> protect against dual disk failures) to get the similar capacity and a
> 7th identical node to allow for node failure/maintenance.
> 
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a
> RAID5 and the expected consequences, once with a RAID10 where I got
> lucky (8 disks total each time).

The thing is, in default config each copy of data is on different
physical machine, to allow for maintenance and hardware failures

in that case, losing 3 disks in one node is much better in 6 node
cluster, than in 2 node cluster, as data transfers needed for recover
is only 1/6th of your dataset, and also time to recovery is much
shorter as you need to read only 3TB data from whole cluster, not
3TB * 9 disks as it is in RAID6

first setup saves you from "3 disks in different machines are dead" at
cost of  much of your IO and long recovery time

second setup have potential to recover much quicker, as it only needs
to transfer 3TB of data per disk failure to recover to clean state,
compared to 3TBx9 per RAID disk. Also impact of one node dead is vastly
lower.

basically, first case is better when disks drop dead exactly at same
time, second one is better when disks drop within few hours between
eachother

> So am I completely off my wagon here? 
> How do people deal with this when potentially deploying hundreds of
> disks in a single cluster/pool?
> 
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive
> when replacing a faulty one in that Supermicro contraption), that
> ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
> mentioned up there.
> 

That problem will only occur if you really want to have all those 600
disks in one pool and it so happens that 3 drives in different servers
unrecoverably die in same very short time interval, which is unlikely.
But with 60 disks per enclosue, RAIDing them into 4-5 groups probably
makes more sense than running 60 OSDs just from memory/cpu usage
standpoint

From my experience disks rarely "just die" often it's either starts
to have bad blocks and write errors  or performance degrades and it starts spewing media
errors (which usually means you can
recover 90%+ data from it if you need to without using anything more
than ddrescue). Which means ceph can access most of data for recovery
and recover just those few missing blocks.

each pool consist of many PGs and to make PG fail all disks had to be
hit so in worst case you will most likely just lose access to small part
(that pg that out of 600 disks happened to be on those 3) of data, not
everything that is on given array.

And again, that's only in case those disks die exactly at same moment,
with no time to recovery. even 60 min between failures will let most of
the data replicate. And in worst case, there is always data recovery
service. And backups

-- 
Mariusz Gronczewski, Administrator

Efigence Sp. z o. o.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczewski@xxxxxxxxxxxx
<mailto:mariusz.gronczewski@xxxxxxxxxxxx>
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com