Re: Possible improvements for a slow write speed (excluding independent SSD journals)

J-P Methot <jpmethot@xxxxxxxxxx> · Mon, 20 Apr 2015 13:17:18 -0400

On 4/20/2015 11:01 AM, Christian Balzer wrote:
Hello,

On Mon, 20 Apr 2015 10:30:41 -0400 J-P Methot wrote:

Hi,

This is similar to another thread running right now, but since our
current setup is completely different from the one described in the
other thread, I thought it may be better to start a new one.

We are running Ceph Firefly 0.80.8 (soon to be upgraded to 0.80.9). We
have 6 OSD hosts with 16 OSD each (so a total of 96 OSDs). Each OSD is a
Samsung SSD 840 EVO on which I can reach write speeds of roughly 400
MB/sec, plugged in jbod on a controller that can theoretically transfer
at 6gb/sec. All of that is linked to openstack compute nodes on two
bonded 10gbps links (so a max transfer rate of 20 gbps).

I sure as hell hope you're not planning to write all that much to this
cluster.
But then again you're worried about write speed, so I guess you do.
Those _consumer_ SSDs will be dropping like flies, there are a number of
threads about them here.

They also might be of the kind that don't play well with O_DSYNC, I can't
recall for sure right now, check the archives.

Consumer SSDs universally tend to slow down quite a bit when not TRIM'ed
and/or subjected to prolonged writes, like those generated by a benchmark.
I see, yes it looks like these SSDs are not the best for the job. We 
will not change them for now, but if they start failing, we will replace 
them with better ones.
When I run rados bench from the compute nodes, I reach the network cap
in read speed. However, write speeds are vastly inferior, reaching about
920 MB/sec. If I have 4 compute nodes running the write benchmark at the
same time, I can see the number plummet to 350 MB/sec . For our planned
usage, we find it to be rather slow, considering we will run a high
number of virtual machines in there.

If you'd researched the archives about SSD based clusters, you'd be aware
that things are likely going to be CPU bound, especially with Firefly or
older.
Verify that with atop or the likes, how busy are your CPUs and SSDs on
those nodes when running these benchmarks?

Also, default rados bench parameters?

You'll get a much more realistic result with fio, especially with fio from
inside VMs.
I did FIO tests from inside vms in the past and they were pretty much in 
line with the results I get from rados bench. I am using rados bench 
default parameters.

The load numbers on the OSD nodes aren't that high. I'm getting a load 
of 3 on a six-core E5-2620, which is not that high for such a cpu. 
Regardless, I am currently upgrading to hammer in hope of at least 
achieving some progress.
Of course, the first thing to do would be to transfer the journal on
faster drives. However, these are SSDs we're talking about. We don't
really have access to faster drives. I must find a way to get better
write speeds. Thus, I am looking for suggestions as to how to make it
faster.

I have also thought of options myself like:
-Upgrading to the latest stable hammer version (would that really give
me a big performance increase?)
If you're CPU bound right now, yes.
Don't expect miracles, though.
 From what I've seen so far (no test cluster of the SSD kind myself) it
seems 50% faster, maybe more.

-Crush map modifications? (this is a long shot, but I'm still using the
default crush map, maybe there's a change there I could make to improve
performances)

You have uniform OSDs, so not really.

Any suggestions as to anything else I can tweak would be strongly
appreciated.

For reference, here's part of my ceph.conf:

[global]
auth_service_required = cephx
filestore_xattr_use_omap = true
auth_client_required = cephx
auth_cluster_required = cephx
osd pool default size = 3

osd pg bits = 12
osd pgp bits = 12
osd pool default pg num = 800
osd pool default pgp num = 800

How many pools do you have?
For your cluster a total of 8192 would be about right.
PGs should be power of 2 values if possible, so with 8-10 pools (of equal
expected data size) a value of 1024 would be better.

Regards,

Christian
Running 10 pools right now, but 2 are for test purposes. Once the two 
test are removed, I'm at around 8000 PGs, which is about what you 
suggested. I can increase the PG number to 1024 per pool, but I'm not 
convinced a change of 200 PGs will show a drastic improvement in speed.

[client]
rbd cache = true
rbd cache writethrough until flush = true

[osd]
filestore_fd_cache_size = 1000000
filestore_omap_header_cache_size = 1000000
filestore_fd_cache_random = true
filestore_queue_max_ops = 5000
journal_queue_max_ops = 1000000
max_open_files = 1000000
osd journal size = 10000

--
======================
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmethot@xxxxxxxxxx
http://www.gtcomm.net

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com