Re: Observations with a SSD based pool under Hammer

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Wed, 24 Feb 2016 23:01:43 -0700

With my S3500 drives in my test cluster, the latest master branch gave me an almost 2x increase in performance compare to just a month or two ago. There looks to be some really nice things coming in Jewel around SSD performance. My drives are now 80-85% busy doing about 10-12K IOPS when doing 4K fio to libRBD. 
Sent from a mobile device, please excuse any typos.
On Feb 24, 2016 8:10 PM, "Christian Balzer" <chibi@xxxxxxx> wrote:

Hello,

For posterity and of course to ask some questions, here are my experiences

with a pure SSD pool.

SW: Debian Jessie, Ceph Hammer 0.94.5.

HW:

2 nodes (thus replication of 2) with each:

2x E5-2623 CPUs

64GB RAM

4x DC S3610 800GB SSDs

Infiniband (IPoIB) network

Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4,

Ceph journal is inline (journal file).

Performance:

A test run with "rados -p cache  bench 30 write -t 32" (4MB blocks) gives

me about 620MB/s, the storage nodes are I/O bound (all SSDs are 100% busy

according to atop) and this meshes nicely with the speeds I saw when

testing the individual SSDs with fio before involving Ceph.

To elaborate on that, an individual SSD of that type can do about 500MB/s

sequential writes, so ideally you would see 1GB/s writes with Ceph

(500*8/2(replication)/2(journal on same disk).

However my experience tells me that other activities (FS journals, leveldb

PG updates, etc) impact things as well.

A test run with "rados -p cache  bench 30 write -t 32 -b 4096" (4KB

blocks) gives me about 7200 IOPS, the SSDs are about 40% busy.

All OSD processes are using about 2 cores and the OS another 2, but that

leaves about 6 cores unused (MHz on all cores scales to max during the

test run).

Closer inspection with all CPUs being displayed in atop shows that no

single core is fully used, they all average around 40% and even the

busiest ones (handling IRQs) still have ample capacity available.

I'm wondering if this an indication of insufficient parallelism or if it's

latency of sorts.

I'm aware of the many tuning settings for SSD based OSDs, however I was

expecting to run into a CPU wall first and foremost.

Write amplification:

10 second rados bench with 4MB blocks, 6348MB written in total.

nand-writes per SSD:118*32MB=3776MB.

30208MB total written to all SSDs.

Amplification:4.75

Very close to what you would expect with a replication of 2 and journal on

same disk.

10 second rados bench with 4KB blocks, 219MB written in total.

nand-writes per SSD:41*32MB=1312MB.

10496MB total written to all SSDs.

Amplification:48!!!

Le ouch.

In my use case with rbd cache on all VMs I expect writes to be rather

large for the most part and not like this extreme example.

But as I wrote the last time I did this kind of testing, this is an area

where caveat emptor most definitely applies when planning and buying SSDs.

And where the Ceph code could probably do with some attention.

Regards,

Christian

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com