It really sounds like you're looking
for a better RAID system, not a distributed storage system.
I've been using ZFS on FreeBSD for years. The Linux port meets nearly all of your needs, while acting more like a conventional software RAID. BtrFS also has a lot of these features, but I'm not familiar enough to advocate for it. ZFS only rebuilds allocated parts of the disk, same as Ceph.I feel that Ceph is better than mdraid because: 1) When ceph cluster is far from being full, 'rebuilding' will be much faster vs mdraid This is not as straight forwards, but it is available. ZFS gives several different RAID-like levels, and it lets you control the number of copies you keep on disk. So you can create something that looks like RAID10 (stripes of mirrors), or a RAID5+. With 6 disks, I'd go RAIDZ-2 (2 parity disks, for ~12TB usable). RAIDZ-2 is than RAID10-like (in my PostgreSQL benchmarks, YMMV), and safer. With 2 parity disks, you'd have to lose 3 disks to lose data. Just keep in mind that ZFS is not RAID, just RAID-like. I still call the volumes a RAID10 or RAID5, but the analogy breaks down below the volume level.2) You can easily change the number of replicas If you have really important data, you can also tell it to keep 2 (or more) copies of the file, regardless of type of RAID. You can set that replica policy per file, or per filesystem. ZFS checksums every block. If you're using RAID10-like, it will recover blocks that failed the checksum from the mirror. If you're using RAID5-like, it will rebuild from parity. Because it has a checksum of every block, it only rebuilds the failed ones. It does have to checksum every block to find the failed once though. My 10TB volumes takes about 12 hours to replace a failed 2TB disk.3) When multiple disks have bad sectors, I suspect ceph will be much easier to recover data from than from mdraid which will simply never finish rebuilding. ZFS snapshot && ZFS send. It's not completely online, but I've moved 5TB to a new server with a 5 minute outage window (pre-copy all the data, shutdown, send a final snapshot, flip the clients to the new server).4) If we need to migrate data over to a different server with no downtime, we just add more OSDs, wait, and then remove the old ones :-) If you can't tell, I'm a big fan of ZFS. I'm hoping to run my dev Ceph cluster on ZFS soon. This will be a single server configuration, the goal is to replace mdraid, hence I tried to use localhost (nothing more will be added to the cluster). Are you saying it will be less fault tolerant than a RAID-10?Ceph is a distributed object store. If you stay within a single machine, keep using a local RAID solution (hardware or software).Why would you want to make this switch?I do not think RAID-10 on 6 3TB disks is going to be reliable at all. I have simulated several failures, and it looks like a rebuild will take a lot of time. Funnily, during one of these experiments, another drive failed, and I had lost the entire array. Good luck recovering from that... I feel that Ceph is better than mdraid because: 1) When ceph cluster is far from being full, 'rebuilding' will be much faster vs mdraid 2) You can easily change the number of replicas 3) When multiple disks have bad sectors, I suspect ceph will be much easier to recover data from than from mdraid which will simply never finish rebuilding. 4) If we need to migrate data over to a different server with no downtime, we just add more OSDs, wait, and then remove the old ones :-) This is my initial observation though, so please correct me if I am wrong. Dmitry _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com