Re: Possible improvements for a slow write speed (excluding independent SSD journals)

Christian Balzer <chibi@xxxxxxx> · Tue, 21 Apr 2015 00:01:28 +0900

Hello,

On Mon, 20 Apr 2015 10:30:41 -0400 J-P Methot wrote:

> Hi,
> 
> This is similar to another thread running right now, but since our 
> current setup is completely different from the one described in the 
> other thread, I thought it may be better to start a new one.
> 
> We are running Ceph Firefly 0.80.8 (soon to be upgraded to 0.80.9). We 
> have 6 OSD hosts with 16 OSD each (so a total of 96 OSDs). Each OSD is a 
> Samsung SSD 840 EVO on which I can reach write speeds of roughly 400 
> MB/sec, plugged in jbod on a controller that can theoretically transfer 
> at 6gb/sec. All of that is linked to openstack compute nodes on two 
> bonded 10gbps links (so a max transfer rate of 20 gbps).
> 

I sure as hell hope you're not planning to write all that much to this
cluster. 
But then again you're worried about write speed, so I guess you do.
Those _consumer_ SSDs will be dropping like flies, there are a number of
threads about them here.

They also might be of the kind that don't play well with O_DSYNC, I can't
recall for sure right now, check the archives.

Consumer SSDs universally tend to slow down quite a bit when not TRIM'ed
and/or subjected to prolonged writes, like those generated by a benchmark.

> When I run rados bench from the compute nodes, I reach the network cap 
> in read speed. However, write speeds are vastly inferior, reaching about 
> 920 MB/sec. If I have 4 compute nodes running the write benchmark at the 
> same time, I can see the number plummet to 350 MB/sec . For our planned 
> usage, we find it to be rather slow, considering we will run a high 
> number of virtual machines in there.
> 
If you'd researched the archives about SSD based clusters, you'd be aware
that things are likely going to be CPU bound, especially with Firefly or
older.
Verify that with atop or the likes, how busy are your CPUs and SSDs on
those nodes when running these benchmarks?

Also, default rados bench parameters?

You'll get a much more realistic result with fio, especially with fio from
inside VMs.

> Of course, the first thing to do would be to transfer the journal on 
> faster drives. However, these are SSDs we're talking about. We don't 
> really have access to faster drives. I must find a way to get better 
> write speeds. Thus, I am looking for suggestions as to how to make it 
> faster.
> 
> I have also thought of options myself like:
> -Upgrading to the latest stable hammer version (would that really give 
> me a big performance increase?)

If you're CPU bound right now, yes.
Don't expect miracles, though.
>From what I've seen so far (no test cluster of the SSD kind myself) it
seems 50% faster, maybe more.

> -Crush map modifications? (this is a long shot, but I'm still using the 
> default crush map, maybe there's a change there I could make to improve 
> performances)
> 
You have uniform OSDs, so not really.

> Any suggestions as to anything else I can tweak would be strongly 
> appreciated.
> 
> For reference, here's part of my ceph.conf:
> 
> [global]
> auth_service_required = cephx
> filestore_xattr_use_omap = true
> auth_client_required = cephx
> auth_cluster_required = cephx
> osd pool default size = 3
> 
> 
> osd pg bits = 12
> osd pgp bits = 12
> osd pool default pg num = 800
> osd pool default pgp num = 800
>
How many pools do you have?
For your cluster a total of 8192 would be about right.
PGs should be power of 2 values if possible, so with 8-10 pools (of equal
expected data size) a value of 1024 would be better.

Regards,

Christian

> [client]
> rbd cache = true
> rbd cache writethrough until flush = true
> 
> [osd]
> filestore_fd_cache_size = 1000000
> filestore_omap_header_cache_size = 1000000
> filestore_fd_cache_random = true
> filestore_queue_max_ops = 5000
> journal_queue_max_ops = 1000000
> max_open_files = 1000000
> osd journal size = 10000
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com