Re: Ceph instead of RAID

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It really sounds like you're looking for a better RAID system, not a distributed storage system.

I've been using ZFS on FreeBSD for years.  The Linux port meets nearly all of your needs, while acting more like a conventional software RAID.  BtrFS also has a lot of these features, but I'm not familiar enough to advocate for it.


I feel that Ceph is better than mdraid because:
1) When ceph cluster is far from being full, 'rebuilding' will be much faster vs mdraid
ZFS only rebuilds allocated parts of the disk, same as Ceph.
2) You can easily change the number of replicas
This is not as straight forwards, but it is available.  ZFS gives several different RAID-like levels, and it lets you control the number of copies you keep on disk.  So you can create something that looks like RAID10 (stripes of mirrors), or a RAID5+.  With 6 disks, I'd go RAIDZ-2 (2 parity disks, for ~12TB usable).  RAIDZ-2 is than RAID10-like (in my PostgreSQL benchmarks, YMMV), and safer.  With 2 parity disks, you'd have to lose 3 disks to lose data.  Just keep in mind that ZFS is not RAID, just RAID-like.  I still call the volumes a RAID10 or RAID5, but the analogy breaks down below the volume level.

If you have really important data, you can also tell it to keep 2 (or more) copies of the file, regardless of type of RAID.  You can set that replica policy per file, or per filesystem.

3) When multiple disks have bad sectors, I suspect ceph will be much easier to recover data from than from
mdraid which will simply never finish rebuilding.
ZFS checksums every block.  If you're using RAID10-like, it will recover blocks that failed the checksum from the mirror.  If you're using RAID5-like, it will rebuild from parity.  Because it has a checksum of every block, it only rebuilds the failed ones.  It does have to checksum every block to find the failed once though.  My 10TB volumes takes about 12 hours to replace a failed 2TB disk.

4) If we need to migrate data over to a different server with no downtime, we just add more OSDs, wait, and
then remove the old ones :-)
ZFS snapshot && ZFS send.  It's not completely online, but I've moved 5TB to a new server with a 5 minute outage window (pre-copy all the data, shutdown, send a final snapshot, flip the clients to the new server).



If you can't tell, I'm a big fan of ZFS.  I'm hoping to run my dev Ceph cluster on ZFS soon.


Craig Lewis
Senior Systems Engineer
Office +1.714.602.1309
Email clewis@xxxxxxxxxxxxxxxxxx

Central Desktop. Work together in ways you never thought possible.
Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog

On 8/13/13 00:47 , Dmitry Postrigan wrote:
This will be a single server configuration, the goal is to replace mdraid, hence I tried to use localhost
(nothing more will be added to the cluster). Are you saying it will be less fault tolerant than a RAID-10?

      
Ceph is a distributed object store. If you stay within a single machine,
keep using a local RAID solution (hardware or software).

      
Why would you want to make this switch?
I do not think RAID-10 on 6 3TB disks is going to be reliable at all. I have simulated several failures, and
it looks like a rebuild will take a lot of time. Funnily, during one of these experiments, another drive
failed, and I had lost the entire array. Good luck recovering from that...

I feel that Ceph is better than mdraid because:
1) When ceph cluster is far from being full, 'rebuilding' will be much faster vs mdraid
2) You can easily change the number of replicas
3) When multiple disks have bad sectors, I suspect ceph will be much easier to recover data from than from
mdraid which will simply never finish rebuilding.
4) If we need to migrate data over to a different server with no downtime, we just add more OSDs, wait, and
then remove the old ones :-)

This is my initial observation though, so please correct me if I am wrong.

Dmitry

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux