Hi guys,
I would need your help to figure out performance issues on my ceph cluster.
I've read pretty much every thread on the net concerning this topic but
I didn't manage to have acceptable performances.
In my company, we are planning to replace our existing virtualization
infrastucture NAS by a ceph cluster in order to improve the global
platform performances, scalability and security. The current NAS we have
handle about 50k iops.
For this we bought:
2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB
RAM, 2 x 10Gbps network interfaces (bonding)
3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB
RAM, 2 x 10Gbps network interfaces (bonding)
2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 GB
RAM, 2 x 10Gbps network interfaces (bonding)
2 x OSD servers (cache): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz,
256 GB RAM, 2 x SSD INTEL SSDSC2BX200G4 (200 GB) for journal, 6 x SSD
INTEL SSDSC2BX016T4R (1,4 TB) for data, 2 x 10Gbps network interfaces
(bonding)
4 x OSD servers (storage): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz, 256 GB RAM, 4 x SSD TOSHIBA PX02SMF020 (200GB) for journal, 18
x HGST Ultrastar HUC101818CS4204 (1.8TB) for data, 2 x 10Gbps network
interfaces (bonding)
The total of this is 84 OSDs.
I created two 4096 pgs pools, one called rbd-cold-storage and the other
rbd-hot-storage. As you may guess, the rbd-cold-storage is composed of
the 4 OSD servers with platter disks and the rbd-hot-storage is composed
of the 2 OSD servers with SSD disks.
On the rdb-cold-storage, I created an rbd device which is mapped on the
NFS server.
I benched each of the SSD we have and it can handle 40k iops each. As my
replication factor is 2, the theoritical performance of the cluster is
(2 x 6 (OSD cache) x 40k) / 2 = 240k iops.
I'm currently benching the cluster with fio tool from one NFS server.
Here my fio job file:
[global]
ioengine=libaio
iodepth=32
runtime=300
direct=1
filename=/dev/rbd0
group_reporting=1
gtod_reduce=1
randrepeat=1
size=4G
numjobs=1
[4k-rand-write]
new_group
bs=4k
rw=randwrite
stonewall
The problem is I can't get more than 15k iops for writes. In my
monitoring engine, I can see that each of the OSD (cache) SSD are not
doing more than 2,5k iops which seems to correspond with 6 x 2,5k = 15k
iops. I don't expect to reach the theoritical value but reaching 100k
iops would be perfect.
My cluster is running on debian jessie with ceph Hammer v0.94.5 debian
package (compiled with --with-jemalloc option, I also tried without).
Here is my ceph.conf:
[global]
fsid = 5046f766-670f-4705-adcc-290f434c8a83
# basic settings
mon initial members = a01cepmon001,a01cepmon002,a01cepmon003
mon host = 10.10.69.254,10.10.69.253,10.10.69.252
mon osd allow primary affinity = true
# network settings
public network = 10.10.69.128/25
cluster network = 10.10.69.0/25
# auth settings
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
# default pools settings
osd pool default size = 2
osd pool default min size = 1
osd pool default pg num = 8192
osd pool default pgp num = 8192
osd crush chooseleaf type = 1
# debug settings
debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0
throttler perf counter = false
osd enable op tracker = false
## OSD settings
[osd]
# OSD FS settings
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd mount options xfs = rw,noatime,logbsize=256k,delaylog
# OSD journal settings
osd journal block align = true
osd journal aio = true
osd journal dio = true
# Performance tuning
filestore xattr use omap = true
filestore merge threshold = 40
filestore split multiple = 8
filestore max sync interval = 10
filestore queue max ops = 100000
filestore queue max bytes = 1GiB
filestore op threads = 20
filestore journal writeahead = true
filestore fd cache size = 10240
osd op threads = 8
Disabling throttling doesn't change anything.
So after all I read, I would like to know if, since the few months old
threads, someone to fix those kind of problems ? any idea or thoughts to
improve this ?
Thanks.
Rémi
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com