Re: Ceph with high disk densities?

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Mon, 07 Oct 2013 11:31:28 -0500

Hi Scott,

On 10/07/2013 11:15 AM, Scott Devoid wrote:
I brought this up within the context of the RAID discussion, but it did
not garner any responses. [1]

In our small test deployments (160 HDs and OSDs across 20 machines) our
performance is quickly bounded by CPU and memory overhead. These are 2U
machines with 2x 6-core Nehalem; and running 8 OSDs consumed 25% of the
total CPU time. This was a cuttlefish deployment.

You might be interested in trying a more recent release.  We've 
implemented the SSE4 CRC32c instruction for CPUs that support it, which 
dramatically reduces CPU overhead during large sequential writes.  On a 
4U box with 24 spinning disks and 8 SSDs (4 bay unused) this brought CPU 
usage down from something like 80% to around 40% during large sequential 
writes if I'm remembering correctly.  The choice of the underlying 
filesystem will also affect CPU overhead.  BTRFS tends to be a bit more 
CPU intensive than say EXT4.

This seems like a rather high CPU overhead. Particularly when we are
looking to hit density target of 10-15 4TB drives / U within 1.5 years.
Does anyone have suggestions for hitting this requirement? Are there
ways to reduce CPU and memory overhead per OSD?

If nothing else, you can turn off crc32 calculations for the messenger 
in ceph.conf and on the client as a mount parameter if you are using 
cephfs.  That will help.  For small IO, we just started some work to 
look at whether or not we can reduce the amount of memory copying 
happening inside the OSDs which could potentially help here too, 
especially on ARM or other low power platforms.

My one suggestion was to do some form of RAID to join multiple drives
and present them to a single OSD. A 2 drive RAID-0 would halve the OSD
overhead while doubling the failure rate and doubling the rebalance
overhead. It is not clear to me if that is better or not.

If you have 60+ drives per node perhaps.  It kind of depends on how much 
throughput you can push over your network and what your disks and 
controllers are capable of.  Ceph seems to push controllers very hard, 
sometimes with both small random reads/writes and large sequential 
writes concurrently.  The fastest nodes we've tested have multiple 
controllers and skip expander backplanes entirely.

I suspect that in the future, the best platforms for Ceph on spinning 
disks will be extremely dense chassis' that house multiple nodes that 
each have a single CPU, a limited number of OSD disks per node (on a 
dedicated controller and no expander), and possibly some 2.5" bays for 
journals and system disks on an alternate controller.  10GbE would be 
enough to get reasonable performance out a node like this.  With faster 
storage or larger nodes, 40GbE or QDR+ IB might be more attractive.

[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/004833.html

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com