Re: Ceph performances

Rémi BUISSON <remi-buisson@xxxxxxxxx> · Mon, 30 Nov 2015 15:07:19 +0100

Hello Robert,

OK. I already tried this but as you said performances decrease. I just 
built the 10.0.0 version and it seems that there are some regressions in 
there. I've now 3.5 Kiops instead of 21 Kiops in 9.2.0 :-/
Thanks.

Rémi

Le 2015-11-25 18:54, Robert LeBlanc a écrit :
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I'm really surprised that you are getting 100K IOPs from the Intel
S3610s. We are already in the process of ordering some to test along
side other drives, so I should be able to verify that as well. With
the S3700 and S3500, I was able to only get 20K IOPs when running 8
threads (that's about where the performance tapered off). With a
single thread, I was getting 5.5K on the S3500 and 4.4K on the S3700.
I'll be really happy if we are seeing 100K out of the S3610s, that
will make some decisions much easier.

Something else you can try is increasing the number of disk threads in
the OSD so you get more parallelism to the drive. I've heard from
someone else that increase the thread count did not do as much as
partitioning up the drive and having multiple OSDs on the same SSD. He
found that increases diminished after 4 OSDs. You can try something
like that.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWVfXNCRDmVDuy+mK58QAAshgP/RL6sdt2fGdB8OnmyMxs
3LZHgXqeBW8fEJx6hW0y9ElENDIZlW4QawUOcu6eclW9vB7ZrcFpyOjAbnqT
llB6yH+t9NAO0JA7HN/tbME5pzUYM4hI0LxVffnIml+fw5Doj0mm+Lp1tpA4
K2PyUj3PZSj9TDlrL8zkyx5l9xA9NUrPXB/L/hprcDOI+nK6IRCcm/7g1YuG
wlLZxVoRemhtL6isLKGv3s79RwpcXf7bbKRN556Ypj53n8ry19USrNcr+hLy
ZcIeibB9bZIhI5XjA+Fj58D9wqBRM9r0a9yEwXABZ4Sekb/wOWKby3Sr7nJ4
MPuIjplzWV9AEiGm3D0nvaZVlEpVSHjKVhu5nu3DyIWQvKkkOeOVqvwNe7zZ
WJHyaQg9c3viLwGSoxYyOBt4YQ2jJoncWtjj9AkBkQrlfZKlGbz3952SeHct
32UFbcWbkaODX3xbo92oUyitAuuOJTUcAAxamienyZCj7QUSnRVdDLyTZJvk
/4SmFGmh+XXipdPlQKcadky9ZZr9Ipzq9vzIMHo9HY3giSrbs8PZ9N3+HnPb
saqdmQXKaqM/n0i/1Jo85zmxranAEZDEYeR57LwNIBA53IRU2Gbd2ms9KT59
BIh1fgj6n04fSEM6k706YHqzghfdJ+mlqTyB6jdSyAPMNu5oX6LgqlIWoaGi
CWOT
=r3k4
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Nov 25, 2015 at 5:40 AM, Rémi BUISSON <remi-buisson@xxxxxxxxx> 
wrote:

Hello Robert,

Sorry for late answer.

Thanks for your reply. I updated to infernalis and I applied all your
recommendations but it doesn't change anything, with or without cache
tiering :-/

I also compared XFS to EXT4 and BTRFS but it doesn't make the 
difference.
The fio command from Sebastien Han tells me my disks can do 100 Kiops
actually, so it's really frustrating :-S

Rémi

Le 2015-11-07 15:59, Robert LeBlanc a écrit :

You most likely did the wrong test to get baseline Ceph IOPS or of 
your
ssds. Ceph is really hard on SSDS and it does direct sync writes which
drives handle very different even between models of the same brand. 
Start
with
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
as your base numbers and just realize that hammer still can't use all 
those
IOps. I was able to gain 50% in SSD IOPS by: disabling translated huge
pages, ld_preloading jemalloc (uses a little more RAM but your config 
should
be ok), enabling numad, dialing irqbalance, setting vfs_cache_pressure 
to
500, and greatly increasing the network buffers and disabling the slow 
tcp
startup. We are also using EXT4 which I've found is a bit faster but 
it had
recently been reported that someone is having deadlocks/crashes with 
it. We
are having an XFS log issue on one of our clusters causing an OSD or 
two to
fail every week.

When I tested the same workload in an SSD cache tier the performance 
was
only 50% of what I was able to achieve on the pure SSD tier (I'm 
guessing
overhead of the cache tier). And this was with having the entire test 
set in
the SSD tier so there was no spindle activity.

Short answer is that your will need a lot more SSDS to hit your target 
with
hammer. Or if you can wait for Jewel you may be able to get by with 
only
needing a little bit more.

Robert LeBlanc

Sent from a mobile device please excuse any typos.

On Nov 7, 2015 1:24 AM, "Rémi BUISSON" <remi-buisson@xxxxxxxxx> wrote:
Hi guys,

I would need your help to figure out performance issues on my ceph
cluster.
I've read pretty much every thread on the net concerning this topic 
but I
didn't manage to have acceptable performances.
In my company, we are planning to replace our existing virtualization
infrastucture NAS by a ceph cluster in order to improve the global 
platform
performances, scalability and security. The current NAS we have 
handle about
50k iops.

For this we bought:
2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB 
RAM,
2 x 10Gbps network interfaces (bonding)
3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB 
RAM,
2 x 10Gbps network interfaces (bonding)
2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 
GB
RAM, 2 x 10Gbps network interfaces (bonding)
2 x OSD servers (cache): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 
2.40GHz,
256 GB RAM, 2 x SSD INTEL SSDSC2BX200G4 (200 GB) for journal, 6 x SSD 
INTEL
SSDSC2BX016T4R (1,4 TB) for data, 2 x 10Gbps network interfaces 
(bonding)
4 x OSD servers (storage): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 
2.40GHz,
256 GB RAM, 4 x SSD TOSHIBA PX02SMF020 (200GB) for journal, 18 x HGST
Ultrastar HUC101818CS4204 (1.8TB) for data, 2 x 10Gbps network 
interfaces
(bonding)

The total of this is 84 OSDs.

I created two 4096 pgs pools, one called rbd-cold-storage and the 
other
rbd-hot-storage. As you may guess, the rbd-cold-storage is composed 
of the 4
OSD servers with platter disks and the rbd-hot-storage is composed of 
the 2
OSD servers with SSD disks.
On the rdb-cold-storage, I created an rbd device which is mapped on 
the
NFS server.

I benched each of the SSD we have and it can handle 40k iops each. As 
my
replication factor is 2, the theoritical performance of the cluster 
is (2 x
6 (OSD cache) x 40k) / 2 = 240k iops.

I'm currently benching the cluster with fio tool from one NFS server. 
Here
my fio job file:
[global]
ioengine=libaio
iodepth=32
runtime=300
direct=1
filename=/dev/rbd0
group_reporting=1
gtod_reduce=1
randrepeat=1
size=4G
numjobs=1

[4k-rand-write]
new_group
bs=4k
rw=randwrite
stonewall

The problem is I can't get more than 15k iops for writes. In my 
monitoring
engine, I can see that each of the OSD (cache) SSD are not doing more 
than
2,5k iops which seems to correspond with 6 x 2,5k = 15k iops. I don't 
expect
to reach the theoritical value but reaching 100k iops would be 
perfect.
My cluster is running on debian jessie with ceph Hammer v0.94.5 
debian
package (compiled with --with-jemalloc option, I also tried without). 
Here
is my ceph.conf:

[global]
fsid = 5046f766-670f-4705-adcc-290f434c8a83

# basic settings
mon initial members = a01cepmon001,a01cepmon002,a01cepmon003
mon host = 10.10.69.254,10.10.69.253,10.10.69.252
mon osd allow primary affinity = true
# network settings
public network = 10.10.69.128/25
cluster network = 10.10.69.0/25

# auth settings
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

# default pools settings
osd pool default size = 2
osd pool default min size = 1
osd pool default pg num = 8192
osd pool default pgp num = 8192
osd crush chooseleaf type = 1

# debug settings
debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

throttler perf counter = false
osd enable op tracker = false

## OSD settings
[osd]
# OSD FS settings
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd mount options xfs = rw,noatime,logbsize=256k,delaylog

# OSD journal settings
osd journal block align = true
osd journal aio = true
osd journal dio = true

# Performance tuning
filestore xattr use omap = true
filestore merge threshold = 40
filestore split multiple = 8
filestore max sync interval = 10
filestore queue max ops = 100000
filestore queue max bytes = 1GiB
filestore op threads = 20
filestore journal writeahead = true
filestore fd cache size = 10240
osd op threads = 8

Disabling throttling doesn't change anything.
So after all I read, I would like to know if, since the few months 
old
threads, someone to fix those kind of problems ? any idea or thoughts 
to
improve this ?

Thanks.

Rémi
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com