Re: Ceph instead of RAID

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Tue, 13 Aug 2013 11:08:21 -0700

    It really sounds like you're looking
      for a better RAID system, not a distributed storage system.

      I've been using ZFS on FreeBSD for years.  The Linux port meets
      nearly all of your needs, while acting more like a conventional
      software RAID.  BtrFS also has a lot of these features, but I'm
      not familiar enough to advocate for it.

        I feel that Ceph is better than mdraid because:
1) When ceph cluster is far from being full, 'rebuilding' will be much faster vs mdraid

      ZFS only rebuilds allocated parts of the disk, same as Ceph.

        2) You can easily change the number of replicas

      This is not as straight forwards, but it is available.  ZFS gives
      several different RAID-like levels, and it lets you control the
      number of copies you keep on disk.  So you can create something
      that looks like RAID10 (stripes of mirrors), or a RAID5+.  With 6
      disks, I'd go RAIDZ-2 (2 parity disks, for ~12TB usable).  RAIDZ-2
      is than RAID10-like (in my PostgreSQL benchmarks, YMMV), and
      safer.  With 2 parity disks, you'd have to lose 3 disks to lose
      data.  Just keep in mind that ZFS is not RAID, just RAID-like.  I
      still call the volumes a RAID10 or RAID5, but the analogy breaks
      down below the volume level.

      If you have really important data, you can also tell it to keep 2
      (or more) copies of the file, regardless of type of RAID.  You can
      set that replica policy per file, or per filesystem.

        3) When multiple disks have bad sectors, I suspect ceph will be much easier to recover data from than from
mdraid which will simply never finish rebuilding.

      ZFS checksums every block.  If you're using RAID10-like, it will
      recover blocks that failed the checksum from the mirror.  If
      you're using RAID5-like, it will rebuild from parity.  Because it
      has a checksum of every block, it only rebuilds the failed ones. 
      It does have to checksum every block to find the failed once
      though.  My 10TB volumes takes about 12 hours to replace a failed
      2TB disk.

        4) If we need to migrate data over to a different server with no downtime, we just add more OSDs, wait, and
then remove the old ones :-)

      ZFS snapshot && ZFS send.  It's not completely online, but
      I've moved 5TB to a new server with a 5 minute outage window
      (pre-copy all the data, shutdown, send a final snapshot, flip the
      clients to the new server).

      If you can't tell, I'm a big fan of ZFS.  I'm hoping to run my dev
      Ceph cluster on ZFS soon.

              Craig Lewis

               Senior Systems Engineer

                Office +1.714.602.1309

                Email clewis@xxxxxxxxxxxxxxxxxx

              Central Desktop.
                  Work together in ways you never thought possible.

                   Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog  

      On 8/13/13 00:47 , Dmitry Postrigan wrote:

          This will be a single server configuration, the goal is to replace mdraid, hence I tried to use localhost
(nothing more will be added to the cluster). Are you saying it will be less fault tolerant than a RAID-10?

        Ceph is a distributed object store. If you stay within a single machine,
keep using a local RAID solution (hardware or software).

        Why would you want to make this switch?

      I do not think RAID-10 on 6 3TB disks is going to be reliable at all. I have simulated several failures, and
it looks like a rebuild will take a lot of time. Funnily, during one of these experiments, another drive
failed, and I had lost the entire array. Good luck recovering from that...

I feel that Ceph is better than mdraid because:
1) When ceph cluster is far from being full, 'rebuilding' will be much faster vs mdraid
2) You can easily change the number of replicas
3) When multiple disks have bad sectors, I suspect ceph will be much easier to recover data from than from
mdraid which will simply never finish rebuilding.
4) If we need to migrate data over to a different server with no downtime, we just add more OSDs, wait, and
then remove the old ones :-)

This is my initial observation though, so please correct me if I am wrong.

Dmitry

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com