Re: Ceph performance expectations

Alan Johnson <alanj@xxxxxxxxxxxxxx> · Thu, 7 Apr 2016 18:42:50 +0000

Hi Sergio, yes I think you have also answered most of your own points –

The main thing is to try and avoid excessive seeks on the HDDS, it would help to separate the journal and data but since HDDs are heavily dependent on seek
 and latency delays, it would not help to have multiple journals on a single HDD. Of course with SSDs the random access is not an issue and we have found that a SATA SSD as a journal can support around five HDDs and with PCI-e NVMe devices such as Intel 3700
 we can sustain much higher ratios.

I would also take a look at iostat as well to see just how busy the disks are, I would estimate that there is high utilization and in our testing here try and
 use co-located journals and data only on  high density servers (72 bay) where there are enough devices to share the workload within an OSD server. In this case we do see idle time on the disks themselves but from our observations co-locating on the lower density
 servers does cause extremely high utilization.

Even if you could borrow some SSDs for a short duration at least you would know for sure how much the gain would be. The type of SSD is also of course very
 important and this forum has had a number of good discussions relating to endurance, and suitability as a journal device.

From: Sergio A. de Carvalho Jr. [mailto:scarvalhojr@xxxxxxxxx]

Sent: Thursday, April 07, 2016 11:18 AM

To: Alan Johnson

Cc: ceph-users@xxxxxxxxxxxxxx

Subject: Re: [ceph-users] Ceph performance expectations

Thanks, Alan.

Unfortunately, we currently don't have much flexibility in terms of the hardware we can get so adding SSDs might not be possible in the near future. What is the best practice here, allocating, for each OSD, one disk just for data and one
 disk just for thd journal? Since the journals are rather small (in our setup a 5GB partition is created on every disk), wouldn't this a bit of a waste of disk space?

I was wondering if it would make sense to give each OSD one full 4TB disk and use one of the 900 GB disks for all journals (12 journals in this case). Would that cause even more contention since then different OSDs would then be trying
 to write their journals to the same disk?

On Thu, Apr 7, 2016 at 4:13 PM, Alan Johnson <alanj@xxxxxxxxxxxxxx> wrote:

I would strongly consider your journaling setup, (you do mention that you will revisit this) but
 we have found that co-locating journals does impact performance and usually separating them on flash is a good idea. Also not sure of your networking setup which can also have significant impact.

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Sergio A. de Carvalho Jr.

Sent: Thursday, April 07, 2016 5:01 AM

To: ceph-users@xxxxxxxxxxxxxx

Subject: [ceph-users] Ceph performance expectations

Hi all,

I've setup a testing/development Ceph cluster consisting of 5 Dell PowerEdge R720xd servers (256GB RAM, 2x 8-core Xeon E5-2650 @ 2.60 GHz, dual-port 10Gb Ethernet, 2x 900GB + 12x
 4TB disks) running CentOS 6.5 and Ceph Hammer 0.94.6. All servers use one 900GB disk for the root partition and the other 13 disks are assigned to OSDs, so we have 5 x 13 = 65 OSDs in total. We also run 1 monitor on every host. Journals are 5GB partitions
 on each disk (this is something we obviously will need to revisit later). The purpose of this cluster will be to serve as a backend storage for Cinder volumes and Glance images in an OpenStack cloud.

With this setup, I'm getting what I'm considering an "okay" performance:

# rados -p images bench 5 write

 Maintaining 16 concurrent writes of 4194304 bytes for up to 5 seconds or 0 objects

Total writes made:      394

Write size:             4194304

Bandwidth (MB/sec):     299.968

Stddev Bandwidth:       127.334

Max bandwidth (MB/sec): 348

Min bandwidth (MB/sec): 0

Average Latency:        0.212524

Stddev Latency:         0.13317

Max latency:            0.828946

Min latency:            0.0707341

Does that look acceptable? How much more can I expect to achieve by fine-tunning and perhaps using a more efficient setup?

I do understand the bandwidth above is a product of running 16 concurrent writes, and rather small object sizes (4MB). Bandwidth lowers significantly with 64MB and 1 thread:

# rados -p images bench 5 write -b 67108864 -t 1

 Maintaining 1 concurrent writes of 67108864 bytes for up to 5 seconds or 0 objects

Total writes made:      7

Write size:             67108864

Bandwidth (MB/sec):     71.520

Stddev Bandwidth:       24.1897

Max bandwidth (MB/sec): 64

Min bandwidth (MB/sec): 0

Average Latency:        0.894792

Stddev Latency:         0.0547502

Max latency:            0.99311

Min latency:            0.832765

Is such a drop expected?

Now, what I'm really concerned is about upload times. Uploading a randomly-generated 1GB file takes a bit too long:

# time rados -p images put random_1GB /tmp/random_1GB

real      0m35.328s

user      0m0.560s

sys       0m3.665s

Is this normal? If so, if I setup this cluster as a backend for Glance, does that mean uploading a 1GB image will require 35 seconds (plus whatever time Glance requires to do its
 own thing)?

Thanks,

Sergio

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com