Re: Ceph performance expectations

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 7 Apr 2016 12:03:05 -0500

Hi Sergio

On 04/07/2016 07:00 AM, Sergio A. de Carvalho Jr. wrote:
Hi all,

I've setup a testing/development Ceph cluster consisting of 5 Dell
PowerEdge R720xd servers (256GB RAM, 2x 8-core Xeon E5-2650 @ 2.60 GHz,
dual-port 10Gb Ethernet, 2x 900GB + 12x 4TB disks) running CentOS 6.5
and Ceph Hammer 0.94.6. All servers use one 900GB disk for the root
partition and the other 13 disks are assigned to OSDs, so we have 5 x 13
= 65 OSDs in total. We also run 1 monitor on every host. Journals are
5GB partitions on each disk (this is something we obviously will need to
revisit later). The purpose of this cluster will be to serve as a
backend storage for Cinder volumes and Glance images in an OpenStack cloud.

With this setup, I'm getting what I'm considering an "okay" performance:

# rados -p images bench 5 write
  Maintaining 16 concurrent writes of 4194304 bytes for up to 5 seconds
or 0 objects

Total writes made:      394
Write size:             4194304
Bandwidth (MB/sec):     299.968

Stddev Bandwidth:       127.334
Max bandwidth (MB/sec): 348
Min bandwidth (MB/sec): 0
Average Latency:        0.212524
Stddev Latency:         0.13317
Max latency:            0.828946
Min latency:            0.07073

Does that look acceptable? How much more can I expect to achieve by
fine-tunning and perhaps using a more efficient setup?

I'll assume 3x replication for these tests.  In reasonable conditions 
you should be able to get about ~70MB/s raw per standard 7200rpm 
spinning disk with Ceph using filestore with XFS.  For 65 OSDs, let's 
say about 4.5GB/s.  Divide that by 3 for replication and you get 
1.5GB/s.  Now add the journal double write penalty and you are down to 
about 750MB/s.  So I'd say your aggregate throughput here is lower than 
what you might ideally see.

The first step would probably be to increase the concurrency and see how 
much that helps.

I do understand the bandwidth above is a product of running 16
concurrent writes, and rather small object sizes (4MB). Bandwidth lowers
significantly with 64MB and 1 thread:

# rados -p images bench 5 write -b 67108864 -t 1
  Maintaining 1 concurrent writes of 67108864 bytes for up to 5 seconds
or 0 objects

Total writes made:      7
Write size:             67108864
Bandwidth (MB/sec):     71.520

Stddev Bandwidth:       24.1897
Max bandwidth (MB/sec): 64
Min bandwidth (MB/sec): 0
Average Latency:        0.894792
Stddev Latency:         0.0547502
Max latency:            0.99311
Min latency:            0.832765

Is such a drop expected?

Yep!  Concurrency is really important for distributed systems and Ceph 
is no exception.  If you only keep 1 write in flight, you can't really 
expect better than the performance of a single OSD.  Ceph writes a fully 
copy of the data to the journal before sending a write acknowledgment to 
the client.  In fact every replica write also has to be fully written to 
the journal on the secondary OSDs as well.  These writes happen in 
parallel, but it adds latency and you'll only be as fast overall as the 
slowest of all of these journal writes.  In your case, you also have to 
contend with the filesystem writes contending with the journal writes 
down since the journals are co-located.

In this case you probably are only getting 71MB/s because the test is so 
short.  In practice with co-located journals I'd expect for a longer 
running test you'd actually get less than this in practice.

Now, what I'm really concerned is about upload times. Uploading a
randomly-generated 1GB file takes a bit too long:

# time rados -p images put random_1GB /tmp/random_1GB

real0m35.328s
user0m0.560s
sys0m3.665s

Is this normal? If so, if I setup this cluster as a backend for Glance,
does that mean uploading a 1GB image will require 35 seconds (plus
whatever time Glance requires to do its own thing)?

And here's where you are getting less.  I'd hope for a little faster 
than 29MB/s, but given how your cluster is setup 30-40MB/s is probably 
about right.  If you need this use-case to be faster, you have a couple 
of options.

1) Wait for bluestore to become production ready.  This is the new OSD 
backend that specifically avoids full-data journal writes for large 
sequential write IO.  Expect per-osd speed to be around 1.5-2X as fast 
in this case for spinning disk only clusters.

2) Move the journals off the disks.  A common way to do this is to buy a 
couple of very fast, high write endurance NVMe or SSDs.  Some of the 
newer NVMe drives are fast enough to support journals for 15-20 spinning 
disks each.  Just make sure they have enough write endurance to meet 
your needs.  Assuming no other bottlenecks, this is usually close to a 
2X large write IO performance improvement.

3) If that's not good enough, you might consider buying a small set of 
SSDs/NVMes for a dedicated SSD pool for specific cases like this.  Even 
in this setup, you'll likely see higher performance with more 
concurrency.  Here's an example I just ran on a 4 node cluster using a 
single fast NVMe drive per node:

rados -p cbt-librbdfio bench 30 write -t 1
Bandwidth (MB/sec):     180.205

rados -p cbt-librbdfio bench 30 write -t 16
Bandwidth (MB/sec):     1197.11

Thanks,

Sergio

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com