Re: Large storage nodes - best practices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



In the previous email, you are forgetting Raid1 has a write penalty of 2 since it is mirroring and now we are talking about different types of raid and nothing really to do about Ceph. One of the main advantages of Ceph is to have data replicated so you don't have to do Raid to that degree. I am sure there is math to do this but larger quantity of smaller nodes have better fail-over than a few large nodes. If you are competing over CPU resources then you can use Raid0 with minimal write penalty (never thought I suggest Raid0 haha). You may not max out the drive speed because of CPU but that is the cost of switching to a data system the machine was not intended for. It would be good information to know the limits of what a machine could do with Ceph, so please do share if you do some tests.

Overall from my understanding it is generally better to move to the ideal node size for Ceph then slowly deprecate the larger nodes also fundamentally since replication is done at a higher level than individual spinners. The idea of doing raid falls farther behind.


On Mon, Aug 5, 2013 at 5:05 PM, James Harper <james.harper@xxxxxxxxxxxxxxxx> wrote:
> I am looking at evaluating ceph for use with large storage nodes (24-36 SATA
> disks per node, 3 or 4TB per disk, HBAs, 10G ethernet).
>
> What would be the best practice for deploying this? I can see two main
> options.
>
> (1) Run 24-36 osds per node. Configure ceph to replicate data to one or more
> other nodes. This means that if a disk fails, there will have to be an
> operational process to stop the osd, unmount and replace the disk, mkfs a
> new filesystem, mount it, and restart the osd - which could be more
> complicated and error-prone than a RAID swap would be.
>
> (2) Combine the disks using some sort of RAID (or ZFS raidz/raidz2), and run
> one osd per node. In this case:
> * if I use RAID0 or LVM, then a single disk failure will cause all the data on the
> node to be lost and rebuilt
> * if I use RAID5/6, then write performance is likely to be poor
> * if I use RAID10, then capacity is reduced by half; with ceph replication each
> piece of data will be replicated 4 times (twice on one node, twice on the
> replica node)
>
> It seems to me that (1) is what ceph was designed to achieve, maybe with 2
> or 3 replicas. Is this what's recommended?
>

There is a middle ground to consider - 12-18 OSD's each running on a pair of disks in a RAID1 configuration. This would reduce most disk failures to a simple disk swap (assuming an intelligent hardware RAID controller). Obviously you still have a 50% reduction in disk space, but you have the advantage that your filesystem never sees the bad disk and all the problems that can cause.

James

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Follow Me: @Scottix
http://about.me/scottix
Scottix@xxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux