Hi Scott, On 10/07/2013 11:15 AM, Scott Devoid wrote:
I brought this up within the context of the RAID discussion, but it did not garner any responses. [1] In our small test deployments (160 HDs and OSDs across 20 machines) our performance is quickly bounded by CPU and memory overhead. These are 2U machines with 2x 6-core Nehalem; and running 8 OSDs consumed 25% of the total CPU time. This was a cuttlefish deployment.
You might be interested in trying a more recent release. We've implemented the SSE4 CRC32c instruction for CPUs that support it, which dramatically reduces CPU overhead during large sequential writes. On a 4U box with 24 spinning disks and 8 SSDs (4 bay unused) this brought CPU usage down from something like 80% to around 40% during large sequential writes if I'm remembering correctly. The choice of the underlying filesystem will also affect CPU overhead. BTRFS tends to be a bit more CPU intensive than say EXT4.
This seems like a rather high CPU overhead. Particularly when we are looking to hit density target of 10-15 4TB drives / U within 1.5 years. Does anyone have suggestions for hitting this requirement? Are there ways to reduce CPU and memory overhead per OSD?
If nothing else, you can turn off crc32 calculations for the messenger in ceph.conf and on the client as a mount parameter if you are using cephfs. That will help. For small IO, we just started some work to look at whether or not we can reduce the amount of memory copying happening inside the OSDs which could potentially help here too, especially on ARM or other low power platforms.
My one suggestion was to do some form of RAID to join multiple drives and present them to a single OSD. A 2 drive RAID-0 would halve the OSD overhead while doubling the failure rate and doubling the rebalance overhead. It is not clear to me if that is better or not.
If you have 60+ drives per node perhaps. It kind of depends on how much throughput you can push over your network and what your disks and controllers are capable of. Ceph seems to push controllers very hard, sometimes with both small random reads/writes and large sequential writes concurrently. The fastest nodes we've tested have multiple controllers and skip expander backplanes entirely.
I suspect that in the future, the best platforms for Ceph on spinning disks will be extremely dense chassis' that house multiple nodes that each have a single CPU, a limited number of OSD disks per node (on a dedicated controller and no expander), and possibly some 2.5" bays for journals and system disks on an alternate controller. 10GbE would be enough to get reasonable performance out a node like this. With faster storage or larger nodes, 40GbE or QDR+ IB might be more attractive.
[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/004833.html _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com