Observations with a SSD based pool under Hammer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello, 

For posterity and of course to ask some questions, here are my experiences
with a pure SSD pool.

SW: Debian Jessie, Ceph Hammer 0.94.5.

HW:
2 nodes (thus replication of 2) with each: 
2x E5-2623 CPUs
64GB RAM
4x DC S3610 800GB SSDs
Infiniband (IPoIB) network

Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4,
Ceph journal is inline (journal file).

Performance:
A test run with "rados -p cache  bench 30 write -t 32" (4MB blocks) gives
me about 620MB/s, the storage nodes are I/O bound (all SSDs are 100% busy
according to atop) and this meshes nicely with the speeds I saw when
testing the individual SSDs with fio before involving Ceph.

To elaborate on that, an individual SSD of that type can do about 500MB/s
sequential writes, so ideally you would see 1GB/s writes with Ceph
(500*8/2(replication)/2(journal on same disk).
However my experience tells me that other activities (FS journals, leveldb
PG updates, etc) impact things as well.

A test run with "rados -p cache  bench 30 write -t 32 -b 4096" (4KB
blocks) gives me about 7200 IOPS, the SSDs are about 40% busy.
All OSD processes are using about 2 cores and the OS another 2, but that
leaves about 6 cores unused (MHz on all cores scales to max during the
test run). 
Closer inspection with all CPUs being displayed in atop shows that no
single core is fully used, they all average around 40% and even the
busiest ones (handling IRQs) still have ample capacity available.
I'm wondering if this an indication of insufficient parallelism or if it's
latency of sorts.
I'm aware of the many tuning settings for SSD based OSDs, however I was
expecting to run into a CPU wall first and foremost.


Write amplification:
10 second rados bench with 4MB blocks, 6348MB written in total. 
nand-writes per SSD:118*32MB=3776MB. 
30208MB total written to all SSDs.
Amplification:4.75

Very close to what you would expect with a replication of 2 and journal on
same disk.


10 second rados bench with 4KB blocks, 219MB written in total. 
nand-writes per SSD:41*32MB=1312MB. 
10496MB total written to all SSDs.
Amplification:48!!!

Le ouch. 
In my use case with rbd cache on all VMs I expect writes to be rather
large for the most part and not like this extreme example. 
But as I wrote the last time I did this kind of testing, this is an area
where caveat emptor most definitely applies when planning and buying SSDs.
And where the Ceph code could probably do with some attention.
 
Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux