Ceph performances

Rémi BUISSON <remi-buisson@xxxxxxxxx> · Sat, 07 Nov 2015 09:24:06 +0100

Hi guys,

I would need your help to figure out performance issues on my ceph cluster.
I've read pretty much every thread on the net concerning this topic but 
I didn't manage to have acceptable performances.
In my company, we are planning to replace our existing virtualization 
infrastucture NAS by a ceph cluster in order to improve the global 
platform performances, scalability and security. The current NAS we have 
handle about 50k iops.

For this we bought:
2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB 
RAM, 2 x 10Gbps network interfaces (bonding)
3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB 
RAM, 2 x 10Gbps network interfaces (bonding)
2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 GB 
RAM, 2 x 10Gbps network interfaces (bonding)
2 x OSD servers (cache): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 
256 GB RAM, 2 x SSD INTEL SSDSC2BX200G4 (200 GB) for journal, 6 x SSD 
INTEL SSDSC2BX016T4R (1,4 TB) for data, 2 x 10Gbps network interfaces 
(bonding)
4 x OSD servers (storage): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 
2.40GHz, 256 GB RAM, 4 x SSD TOSHIBA PX02SMF020 (200GB) for journal, 18 
x HGST Ultrastar HUC101818CS4204 (1.8TB) for data, 2 x 10Gbps network 
interfaces (bonding)

The total of this is 84 OSDs.

I created two 4096 pgs pools, one called rbd-cold-storage and the other 
rbd-hot-storage. As you may guess, the rbd-cold-storage is composed of 
the 4 OSD servers with platter disks and the rbd-hot-storage is composed 
of the 2 OSD servers with SSD disks.
On the rdb-cold-storage, I created an rbd device which is mapped on the 
NFS server.

I benched each of the SSD we have and it can handle 40k iops each. As my 
replication factor is 2, the theoritical performance of the cluster is 
(2 x 6 (OSD cache) x 40k) / 2 = 240k iops.

I'm currently benching the cluster with fio tool from one NFS server. 
Here my fio job file:
[global]
ioengine=libaio
iodepth=32
runtime=300
direct=1
filename=/dev/rbd0
group_reporting=1
gtod_reduce=1
randrepeat=1
size=4G
numjobs=1

[4k-rand-write]
new_group
bs=4k
rw=randwrite
stonewall

The problem is I can't get more than 15k iops for writes. In my 
monitoring engine, I can see that each of the OSD (cache) SSD are not 
doing more than 2,5k iops which seems to correspond with 6 x 2,5k = 15k 
iops. I don't expect to reach the theoritical value but reaching 100k 
iops would be perfect.

My cluster is running on debian jessie with ceph Hammer v0.94.5 debian 
package (compiled with --with-jemalloc option, I also tried without). 
Here is my ceph.conf:

[global]
fsid = 5046f766-670f-4705-adcc-290f434c8a83

# basic settings
mon initial members = a01cepmon001,a01cepmon002,a01cepmon003
mon host = 10.10.69.254,10.10.69.253,10.10.69.252
mon osd allow primary affinity = true
# network settings
public network = 10.10.69.128/25
cluster network = 10.10.69.0/25

# auth settings
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

# default pools settings
osd pool default size = 2
osd pool default min size = 1
osd pool default pg num = 8192
osd pool default pgp num = 8192
osd crush chooseleaf type = 1

# debug settings
debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

throttler perf counter = false
osd enable op tracker = false

## OSD settings
[osd]
# OSD FS settings
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd mount options xfs = rw,noatime,logbsize=256k,delaylog

# OSD journal settings
osd journal block align = true
osd journal aio = true
osd journal dio = true

# Performance tuning
filestore xattr use omap = true
filestore merge threshold = 40
filestore split multiple = 8
filestore max sync interval = 10
filestore queue max ops = 100000
filestore queue max bytes = 1GiB
filestore op threads = 20
filestore journal writeahead = true
filestore fd cache size = 10240
osd op threads = 8

Disabling throttling doesn't change anything.
So after all I read, I would like to know if, since the few months old 
threads, someone to fix those kind of problems ? any idea or thoughts to 
improve this ?

Thanks.

Rémi
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com