Dnia 2013-12-19, o godz. 17:39:54 Christian Balzer <chibi@xxxxxxx> napisał(a): > > Hello, > > In my "Sanity check" thread I postulated yesterday that to get the > same redundancy and resilience for disk failures (excluding other > factors) as my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, > 2 global hotspares, thus 4 OSDs) the "Ceph way" one would need need > something like 6 nodes with 10 3TB HDs each, 3 way replication (to > protect against dual disk failures) to get the similar capacity and a > 7th identical node to allow for node failure/maintenance. > > That was basically based on me thinking "must not get caught be a dual > disk failure ever again", as that happened twice to me, once with a > RAID5 and the expected consequences, once with a RAID10 where I got > lucky (8 disks total each time). The thing is, in default config each copy of data is on different physical machine, to allow for maintenance and hardware failures in that case, losing 3 disks in one node is much better in 6 node cluster, than in 2 node cluster, as data transfers needed for recover is only 1/6th of your dataset, and also time to recovery is much shorter as you need to read only 3TB data from whole cluster, not 3TB * 9 disks as it is in RAID6 first setup saves you from "3 disks in different machines are dead" at cost of much of your IO and long recovery time second setup have potential to recover much quicker, as it only needs to transfer 3TB of data per disk failure to recover to clean state, compared to 3TBx9 per RAID disk. Also impact of one node dead is vastly lower. basically, first case is better when disks drop dead exactly at same time, second one is better when disks drop within few hours between eachother > So am I completely off my wagon here? > How do people deal with this when potentially deploying hundreds of > disks in a single cluster/pool? > > I mean, when we get too 600 disks (and that's just one rack full, OK, > maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage > servers (or 72 disk per 4U if you're happy with killing another drive > when replacing a faulty one in that Supermicro contraption), that > ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I > mentioned up there. > That problem will only occur if you really want to have all those 600 disks in one pool and it so happens that 3 drives in different servers unrecoverably die in same very short time interval, which is unlikely. But with 60 disks per enclosue, RAIDing them into 4-5 groups probably makes more sense than running 60 OSDs just from memory/cpu usage standpoint From my experience disks rarely "just die" often it's either starts to have bad blocks and write errors or performance degrades and it starts spewing media errors (which usually means you can recover 90%+ data from it if you need to without using anything more than ddrescue). Which means ceph can access most of data for recovery and recover just those few missing blocks. each pool consist of many PGs and to make PG fail all disks had to be hit so in worst case you will most likely just lose access to small part (that pg that out of 600 disks happened to be on those 3) of data, not everything that is on given array. And again, that's only in case those disks die exactly at same moment, with no time to recovery. even 60 min between failures will let most of the data replicate. And in worst case, there is always data recovery service. And backups -- Mariusz Gronczewski, Administrator Efigence Sp. z o. o. ul. Wołoska 9a, 02-583 Warszawa T: [+48] 22 380 13 13 F: [+48] 22 380 13 14 E: mariusz.gronczewski@xxxxxxxxxxxx <mailto:mariusz.gronczewski@xxxxxxxxxxxx>
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com