Hi Sergio On 04/07/2016 07:00 AM, Sergio A. de Carvalho Jr. wrote:
Hi all, I've setup a testing/development Ceph cluster consisting of 5 Dell PowerEdge R720xd servers (256GB RAM, 2x 8-core Xeon E5-2650 @ 2.60 GHz, dual-port 10Gb Ethernet, 2x 900GB + 12x 4TB disks) running CentOS 6.5 and Ceph Hammer 0.94.6. All servers use one 900GB disk for the root partition and the other 13 disks are assigned to OSDs, so we have 5 x 13 = 65 OSDs in total. We also run 1 monitor on every host. Journals are 5GB partitions on each disk (this is something we obviously will need to revisit later). The purpose of this cluster will be to serve as a backend storage for Cinder volumes and Glance images in an OpenStack cloud. With this setup, I'm getting what I'm considering an "okay" performance: # rados -p images bench 5 write Maintaining 16 concurrent writes of 4194304 bytes for up to 5 seconds or 0 objects Total writes made: 394 Write size: 4194304 Bandwidth (MB/sec): 299.968 Stddev Bandwidth: 127.334 Max bandwidth (MB/sec): 348 Min bandwidth (MB/sec): 0 Average Latency: 0.212524 Stddev Latency: 0.13317 Max latency: 0.828946 Min latency: 0.07073 Does that look acceptable? How much more can I expect to achieve by fine-tunning and perhaps using a more efficient setup?
I'll assume 3x replication for these tests. In reasonable conditions you should be able to get about ~70MB/s raw per standard 7200rpm spinning disk with Ceph using filestore with XFS. For 65 OSDs, let's say about 4.5GB/s. Divide that by 3 for replication and you get 1.5GB/s. Now add the journal double write penalty and you are down to about 750MB/s. So I'd say your aggregate throughput here is lower than what you might ideally see.
The first step would probably be to increase the concurrency and see how much that helps.
I do understand the bandwidth above is a product of running 16 concurrent writes, and rather small object sizes (4MB). Bandwidth lowers significantly with 64MB and 1 thread: # rados -p images bench 5 write -b 67108864 -t 1 Maintaining 1 concurrent writes of 67108864 bytes for up to 5 seconds or 0 objects Total writes made: 7 Write size: 67108864 Bandwidth (MB/sec): 71.520 Stddev Bandwidth: 24.1897 Max bandwidth (MB/sec): 64 Min bandwidth (MB/sec): 0 Average Latency: 0.894792 Stddev Latency: 0.0547502 Max latency: 0.99311 Min latency: 0.832765 Is such a drop expected?
Yep! Concurrency is really important for distributed systems and Ceph is no exception. If you only keep 1 write in flight, you can't really expect better than the performance of a single OSD. Ceph writes a fully copy of the data to the journal before sending a write acknowledgment to the client. In fact every replica write also has to be fully written to the journal on the secondary OSDs as well. These writes happen in parallel, but it adds latency and you'll only be as fast overall as the slowest of all of these journal writes. In your case, you also have to contend with the filesystem writes contending with the journal writes down since the journals are co-located.
In this case you probably are only getting 71MB/s because the test is so short. In practice with co-located journals I'd expect for a longer running test you'd actually get less than this in practice.
Now, what I'm really concerned is about upload times. Uploading a randomly-generated 1GB file takes a bit too long: # time rados -p images put random_1GB /tmp/random_1GB real0m35.328s user0m0.560s sys0m3.665s Is this normal? If so, if I setup this cluster as a backend for Glance, does that mean uploading a 1GB image will require 35 seconds (plus whatever time Glance requires to do its own thing)?
And here's where you are getting less. I'd hope for a little faster than 29MB/s, but given how your cluster is setup 30-40MB/s is probably about right. If you need this use-case to be faster, you have a couple of options.
1) Wait for bluestore to become production ready. This is the new OSD backend that specifically avoids full-data journal writes for large sequential write IO. Expect per-osd speed to be around 1.5-2X as fast in this case for spinning disk only clusters.
2) Move the journals off the disks. A common way to do this is to buy a couple of very fast, high write endurance NVMe or SSDs. Some of the newer NVMe drives are fast enough to support journals for 15-20 spinning disks each. Just make sure they have enough write endurance to meet your needs. Assuming no other bottlenecks, this is usually close to a 2X large write IO performance improvement.
3) If that's not good enough, you might consider buying a small set of SSDs/NVMes for a dedicated SSD pool for specific cases like this. Even in this setup, you'll likely see higher performance with more concurrency. Here's an example I just ran on a 4 node cluster using a single fast NVMe drive per node:
rados -p cbt-librbdfio bench 30 write -t 1 Bandwidth (MB/sec): 180.205 rados -p cbt-librbdfio bench 30 write -t 16 Bandwidth (MB/sec): 1197.11
Thanks, Sergio _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com