Re: Large storage nodes - best practices

Mike Dawson <mike.dawson@xxxxxxxxxxxx> · Mon, 05 Aug 2013 12:15:17 -0400

Brian,

Short answer: Ceph generally is used with multiple OSDs per node. One 
OSD per storage drive with no RAID is the most common setup. At 24- or 
36-drives per chassis, there are several potential bottlenecks to consider.

Mark Nelson, the Ceph performance guy at Inktank, has published several 
articles you should consider reading. A few of interest are [0], [1], 
and [2]. The last link is a 5-part series.

There are lots of considerations:

- HBA performance
- Total OSD throughput vs network throughput
- SSD throughput vs. OSD throughput
- CPU / RAM overhead for the OSD processes

Also, note that there is on-going work to add erasure coding as a 
optional backend (as opposed to the current replication scheme). If you 
prioritize bulk storage over performance, you may be interested in 
following the progress [3], [4], and [5].

[0]: 
http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/
[1]: 
http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/
[2]: 
http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/
[3]: 
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
[4]: 
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
[5]: http://www.inktank.com/about-inktank/roadmap/

Cheers,
Mike Dawson

On 8/5/2013 9:50 AM, Brian Candler wrote:
I am looking at evaluating ceph for use with large storage nodes (24-36
SATA disks per node, 3 or 4TB per disk, HBAs, 10G ethernet).

What would be the best practice for deploying this? I can see two main
options.

(1) Run 24-36 osds per node. Configure ceph to replicate data to one or
more other nodes. This means that if a disk fails, there will have to be
an operational process to stop the osd, unmount and replace the disk,
mkfs a new filesystem, mount it, and restart the osd - which could be
more complicated and error-prone than a RAID swap would be.

(2) Combine the disks using some sort of RAID (or ZFS raidz/raidz2), and
run one osd per node. In this case:
* if I use RAID0 or LVM, then a single disk failure will cause all the
data on the node to be lost and rebuilt
* if I use RAID5/6, then write performance is likely to be poor
* if I use RAID10, then capacity is reduced by half; with ceph
replication each piece of data will be replicated 4 times (twice on one
node, twice on the replica node)

It seems to me that (1) is what ceph was designed to achieve, maybe with
2 or 3 replicas. Is this what's recommended?

I have seen some postings which imply one osd per node: e.g.
http://www.sebastien-han.fr/blog/2012/08/17/ceph-storage-node-maintenance/
shows three nodes each with one OSD - but maybe this was just a trivial
example for simplicity.

Looking at
http://ceph.com/docs/next/install/hardware-recommendations/
it says " You *may* run multiple OSDs per host" (my emphasis), and goes
on to caution against having more disk bandwidth than network bandwidth.
Ah, but at another point it says " We recommend using a dedicated drive
for the operating system and software, and one drive for each OSD daemon
you run on the host." So I guess that's fairly clear.

Anything other options I should be considering?

Regards,

Brian.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com