Re: Observations with a SSD based pool under Hammer

Christian Balzer <chibi@xxxxxxx> · Mon, 29 Feb 2016 17:37:24 +0900

On Mon, 29 Feb 2016 02:15:28 -0500 (EST) Shinobu Kinjo wrote:

> Christian,
> 
> > Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4,
> > Ceph journal is inline (journal file).
> 
> Quick question. Is there any reason you selected Ext4?
> 
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg08619.html

XFS has historically always been slower for me, whenever I tested it.
Now with Ceph there are several optimizations (in the latest versions, not
when I started) for XFS.
However there also were (near-lethal) XFS bugs exposed by Ceph.

Lastly XFS seems to fragment faster than Ext4, definitely when used as OSD
FS.
My badly overloaded old production cluster with 800000 files/objects per
OSD has a e4defrag score of 11 (up to 30 is fine) after running for nearly
2 years.
My newer Ext4 OSDs are formatted so that they have LARGE blocks, so the
chance for fragmentation is even lower.
I managed to severely fragment my XFS based test cluster with far less,
synthetic usage.

Now the SSD based OSDs could have been formatted with XFS I suppose as
the last point doesn't apply to them, but I like consistency in my setups.

Christian

> Cheers,
> Shinobu
> 
> ----- Original Message -----
> From: "Christian Balzer" <chibi@xxxxxxx>
> To: ceph-users@xxxxxxxxxxxxxx
> Sent: Thursday, February 25, 2016 12:10:41 PM
> Subject:  Observations with a SSD based pool under Hammer
> 
> 
> Hello, 
> 
> For posterity and of course to ask some questions, here are my
> experiences with a pure SSD pool.
> 
> SW: Debian Jessie, Ceph Hammer 0.94.5.
> 
> HW:
> 2 nodes (thus replication of 2) with each: 
> 2x E5-2623 CPUs
> 64GB RAM
> 4x DC S3610 800GB SSDs
> Infiniband (IPoIB) network
> 
> Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4,
> Ceph journal is inline (journal file).
> 
> Performance:
> A test run with "rados -p cache  bench 30 write -t 32" (4MB blocks) gives
> me about 620MB/s, the storage nodes are I/O bound (all SSDs are 100% busy
> according to atop) and this meshes nicely with the speeds I saw when
> testing the individual SSDs with fio before involving Ceph.
> 
> To elaborate on that, an individual SSD of that type can do about 500MB/s
> sequential writes, so ideally you would see 1GB/s writes with Ceph
> (500*8/2(replication)/2(journal on same disk).
> However my experience tells me that other activities (FS journals,
> leveldb PG updates, etc) impact things as well.
> 
> A test run with "rados -p cache  bench 30 write -t 32 -b 4096" (4KB
> blocks) gives me about 7200 IOPS, the SSDs are about 40% busy.
> All OSD processes are using about 2 cores and the OS another 2, but that
> leaves about 6 cores unused (MHz on all cores scales to max during the
> test run). 
> Closer inspection with all CPUs being displayed in atop shows that no
> single core is fully used, they all average around 40% and even the
> busiest ones (handling IRQs) still have ample capacity available.
> I'm wondering if this an indication of insufficient parallelism or if
> it's latency of sorts.
> I'm aware of the many tuning settings for SSD based OSDs, however I was
> expecting to run into a CPU wall first and foremost.
> 
> 
> Write amplification:
> 10 second rados bench with 4MB blocks, 6348MB written in total. 
> nand-writes per SSD:118*32MB=3776MB. 
> 30208MB total written to all SSDs.
> Amplification:4.75
> 
> Very close to what you would expect with a replication of 2 and journal
> on same disk.
> 
> 
> 10 second rados bench with 4KB blocks, 219MB written in total. 
> nand-writes per SSD:41*32MB=1312MB. 
> 10496MB total written to all SSDs.
> Amplification:48!!!
> 
> Le ouch. 
> In my use case with rbd cache on all VMs I expect writes to be rather
> large for the most part and not like this extreme example. 
> But as I wrote the last time I did this kind of testing, this is an area
> where caveat emptor most definitely applies when planning and buying
> SSDs. And where the Ceph code could probably do with some attention.
>  
> Regards,
> 
> Christian

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com