Re: Large storage nodes - best practices

James Harper <james.harper@xxxxxxxxxxxxxxxx> · Tue, 6 Aug 2013 00:05:37 +0000

> I am looking at evaluating ceph for use with large storage nodes (24-36 SATA
> disks per node, 3 or 4TB per disk, HBAs, 10G ethernet).
> 
> What would be the best practice for deploying this? I can see two main
> options.
> 
> (1) Run 24-36 osds per node. Configure ceph to replicate data to one or more
> other nodes. This means that if a disk fails, there will have to be an
> operational process to stop the osd, unmount and replace the disk, mkfs a
> new filesystem, mount it, and restart the osd - which could be more
> complicated and error-prone than a RAID swap would be.
> 
> (2) Combine the disks using some sort of RAID (or ZFS raidz/raidz2), and run
> one osd per node. In this case:
> * if I use RAID0 or LVM, then a single disk failure will cause all the data on the
> node to be lost and rebuilt
> * if I use RAID5/6, then write performance is likely to be poor
> * if I use RAID10, then capacity is reduced by half; with ceph replication each
> piece of data will be replicated 4 times (twice on one node, twice on the
> replica node)
> 
> It seems to me that (1) is what ceph was designed to achieve, maybe with 2
> or 3 replicas. Is this what's recommended?
> 

There is a middle ground to consider - 12-18 OSD's each running on a pair of disks in a RAID1 configuration. This would reduce most disk failures to a simple disk swap (assuming an intelligent hardware RAID controller). Obviously you still have a 50% reduction in disk space, but you have the advantage that your filesystem never sees the bad disk and all the problems that can cause.

James

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com