On Mon, 29 Feb 2016 02:15:28 -0500 (EST) Shinobu Kinjo wrote: > Christian, > > > Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4, > > Ceph journal is inline (journal file). > > Quick question. Is there any reason you selected Ext4? > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg08619.html XFS has historically always been slower for me, whenever I tested it. Now with Ceph there are several optimizations (in the latest versions, not when I started) for XFS. However there also were (near-lethal) XFS bugs exposed by Ceph. Lastly XFS seems to fragment faster than Ext4, definitely when used as OSD FS. My badly overloaded old production cluster with 800000 files/objects per OSD has a e4defrag score of 11 (up to 30 is fine) after running for nearly 2 years. My newer Ext4 OSDs are formatted so that they have LARGE blocks, so the chance for fragmentation is even lower. I managed to severely fragment my XFS based test cluster with far less, synthetic usage. Now the SSD based OSDs could have been formatted with XFS I suppose as the last point doesn't apply to them, but I like consistency in my setups. Christian > Cheers, > Shinobu > > ----- Original Message ----- > From: "Christian Balzer" <chibi@xxxxxxx> > To: ceph-users@xxxxxxxxxxxxxx > Sent: Thursday, February 25, 2016 12:10:41 PM > Subject: Observations with a SSD based pool under Hammer > > > Hello, > > For posterity and of course to ask some questions, here are my > experiences with a pure SSD pool. > > SW: Debian Jessie, Ceph Hammer 0.94.5. > > HW: > 2 nodes (thus replication of 2) with each: > 2x E5-2623 CPUs > 64GB RAM > 4x DC S3610 800GB SSDs > Infiniband (IPoIB) network > > Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4, > Ceph journal is inline (journal file). > > Performance: > A test run with "rados -p cache bench 30 write -t 32" (4MB blocks) gives > me about 620MB/s, the storage nodes are I/O bound (all SSDs are 100% busy > according to atop) and this meshes nicely with the speeds I saw when > testing the individual SSDs with fio before involving Ceph. > > To elaborate on that, an individual SSD of that type can do about 500MB/s > sequential writes, so ideally you would see 1GB/s writes with Ceph > (500*8/2(replication)/2(journal on same disk). > However my experience tells me that other activities (FS journals, > leveldb PG updates, etc) impact things as well. > > A test run with "rados -p cache bench 30 write -t 32 -b 4096" (4KB > blocks) gives me about 7200 IOPS, the SSDs are about 40% busy. > All OSD processes are using about 2 cores and the OS another 2, but that > leaves about 6 cores unused (MHz on all cores scales to max during the > test run). > Closer inspection with all CPUs being displayed in atop shows that no > single core is fully used, they all average around 40% and even the > busiest ones (handling IRQs) still have ample capacity available. > I'm wondering if this an indication of insufficient parallelism or if > it's latency of sorts. > I'm aware of the many tuning settings for SSD based OSDs, however I was > expecting to run into a CPU wall first and foremost. > > > Write amplification: > 10 second rados bench with 4MB blocks, 6348MB written in total. > nand-writes per SSD:118*32MB=3776MB. > 30208MB total written to all SSDs. > Amplification:4.75 > > Very close to what you would expect with a replication of 2 and journal > on same disk. > > > 10 second rados bench with 4KB blocks, 219MB written in total. > nand-writes per SSD:41*32MB=1312MB. > 10496MB total written to all SSDs. > Amplification:48!!! > > Le ouch. > In my use case with rbd cache on all VMs I expect writes to be rather > large for the most part and not like this extreme example. > But as I wrote the last time I did this kind of testing, this is an area > where caveat emptor most definitely applies when planning and buying > SSDs. And where the Ceph code could probably do with some attention. > > Regards, > > Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com