With my S3500 drives in my test cluster, the latest master branch gave me an almost 2x increase in performance compare to just a month or two ago. There looks to be some really nice things coming in Jewel around SSD performance. My drives are now 80-85% busy doing about 10-12K IOPS when doing 4K fio to libRBD.
Sent from a mobile device, please excuse any typos.
On Feb 24, 2016 8:10 PM, "Christian Balzer" <chibi@xxxxxxx> wrote:
Hello,
For posterity and of course to ask some questions, here are my experiences
with a pure SSD pool.
SW: Debian Jessie, Ceph Hammer 0.94.5.
HW:
2 nodes (thus replication of 2) with each:
2x E5-2623 CPUs
64GB RAM
4x DC S3610 800GB SSDs
Infiniband (IPoIB) network
Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4,
Ceph journal is inline (journal file).
Performance:
A test run with "rados -p cache bench 30 write -t 32" (4MB blocks) gives
me about 620MB/s, the storage nodes are I/O bound (all SSDs are 100% busy
according to atop) and this meshes nicely with the speeds I saw when
testing the individual SSDs with fio before involving Ceph.
To elaborate on that, an individual SSD of that type can do about 500MB/s
sequential writes, so ideally you would see 1GB/s writes with Ceph
(500*8/2(replication)/2(journal on same disk).
However my experience tells me that other activities (FS journals, leveldb
PG updates, etc) impact things as well.
A test run with "rados -p cache bench 30 write -t 32 -b 4096" (4KB
blocks) gives me about 7200 IOPS, the SSDs are about 40% busy.
All OSD processes are using about 2 cores and the OS another 2, but that
leaves about 6 cores unused (MHz on all cores scales to max during the
test run).
Closer inspection with all CPUs being displayed in atop shows that no
single core is fully used, they all average around 40% and even the
busiest ones (handling IRQs) still have ample capacity available.
I'm wondering if this an indication of insufficient parallelism or if it's
latency of sorts.
I'm aware of the many tuning settings for SSD based OSDs, however I was
expecting to run into a CPU wall first and foremost.
Write amplification:
10 second rados bench with 4MB blocks, 6348MB written in total.
nand-writes per SSD:118*32MB=3776MB.
30208MB total written to all SSDs.
Amplification:4.75
Very close to what you would expect with a replication of 2 and journal on
same disk.
10 second rados bench with 4KB blocks, 219MB written in total.
nand-writes per SSD:41*32MB=1312MB.
10496MB total written to all SSDs.
Amplification:48!!!
Le ouch.
In my use case with rbd cache on all VMs I expect writes to be rather
large for the most part and not like this extreme example.
But as I wrote the last time I did this kind of testing, this is an area
where caveat emptor most definitely applies when planning and buying SSDs.
And where the Ceph code could probably do with some attention.
Regards,
Christian
--
Christian Balzer Network/Systems Engineer
chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com