> Yeah, I saw erasure encoding mentioned a little while ago, but that's > likely not to be around by the time I'm going to deploy things. > Nevermind that super bleeding edge isn't my style when it comes to > production systems. ^o^ > And at something like 600 disks, that would still have to be a mighty high > level of replication to combat failure statistics... Not sure if I understand correctly but: It looks like it currently is a raid 01 kind of solution So failure domain is a raid 0 and mirror the failure domain to X replicas. When you have a rep count of 3 you could be unlucky with 3 disks failing in three failure domains at the same time. If you have enough disks in the cluster the chances are this will happen at some point. It would make sense that you would be able to create a raid 10 kind of solution: Where disk1 in failure domain 1 has the same content as disk1 in failure domain 2 and domain 3. So the PGs that are on one OSD will be exactly mirrored to another OSD in another failure domain. This would require more uniform hardware and you lose flexibility but you win a lot of reliability. Without knowing anything about the code base I *think* it should be pretty trivial to change the code to support this and would be a very small change compared to erasure code. ( I looked a bit at crush map Bucket Types but it *seems* that all Bucket types will still stripe the PGs across all nodes within a failure domain ) Cheers, Robert van Leeuwen _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com