Re: Observations with a SSD based pool under Hammer

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 29 Feb 2016 03:34:28 -0600

On 02/29/2016 02:37 AM, Christian Balzer wrote:
On Mon, 29 Feb 2016 02:15:28 -0500 (EST) Shinobu Kinjo wrote:

Christian,

Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4,
Ceph journal is inline (journal file).

Quick question. Is there any reason you selected Ext4?

https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg08619.html

XFS has historically always been slower for me, whenever I tested it.
Now with Ceph there are several optimizations (in the latest versions, not
when I started) for XFS.
However there also were (near-lethal) XFS bugs exposed by Ceph.

Lastly XFS seems to fragment faster than Ext4, definitely when used as OSD
FS.
My badly overloaded old production cluster with 800000 files/objects per
OSD has a e4defrag score of 11 (up to 30 is fine) after running for nearly
2 years.
My newer Ext4 OSDs are formatted so that they have LARGE blocks, so the
chance for fragmentation is even lower.
I managed to severely fragment my XFS based test cluster with far less,
synthetic usage.

You may see a lot less fragmentation with the filestore_xfs_extsize set 
to true.  We did back when we were testing it for hammer.  The problem 
is that on one of the test clusters inside RH it was causing a 
sequential write throughput regression vs firefly.  It doesn't really 
make any sense why that would be, but it was pretty clearly causing it 
after bisecting and narrowing it down to the commit that enabled it.

Mark

Now the SSD based OSDs could have been formatted with XFS I suppose as
the last point doesn't apply to them, but I like consistency in my setups.

Christian

Cheers,
Shinobu

----- Original Message -----
From: "Christian Balzer" <chibi@xxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Sent: Thursday, February 25, 2016 12:10:41 PM
Subject:  Observations with a SSD based pool under Hammer

Hello,

For posterity and of course to ask some questions, here are my
experiences with a pure SSD pool.

SW: Debian Jessie, Ceph Hammer 0.94.5.

HW:
2 nodes (thus replication of 2) with each:
2x E5-2623 CPUs
64GB RAM
4x DC S3610 800GB SSDs
Infiniband (IPoIB) network

Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4,
Ceph journal is inline (journal file).

Performance:
A test run with "rados -p cache  bench 30 write -t 32" (4MB blocks) gives
me about 620MB/s, the storage nodes are I/O bound (all SSDs are 100% busy
according to atop) and this meshes nicely with the speeds I saw when
testing the individual SSDs with fio before involving Ceph.

To elaborate on that, an individual SSD of that type can do about 500MB/s
sequential writes, so ideally you would see 1GB/s writes with Ceph
(500*8/2(replication)/2(journal on same disk).
However my experience tells me that other activities (FS journals,
leveldb PG updates, etc) impact things as well.

A test run with "rados -p cache  bench 30 write -t 32 -b 4096" (4KB
blocks) gives me about 7200 IOPS, the SSDs are about 40% busy.
All OSD processes are using about 2 cores and the OS another 2, but that
leaves about 6 cores unused (MHz on all cores scales to max during the
test run).
Closer inspection with all CPUs being displayed in atop shows that no
single core is fully used, they all average around 40% and even the
busiest ones (handling IRQs) still have ample capacity available.
I'm wondering if this an indication of insufficient parallelism or if
it's latency of sorts.
I'm aware of the many tuning settings for SSD based OSDs, however I was
expecting to run into a CPU wall first and foremost.

Write amplification:
10 second rados bench with 4MB blocks, 6348MB written in total.
nand-writes per SSD:118*32MB=3776MB.
30208MB total written to all SSDs.
Amplification:4.75

Very close to what you would expect with a replication of 2 and journal
on same disk.

10 second rados bench with 4KB blocks, 219MB written in total.
nand-writes per SSD:41*32MB=1312MB.
10496MB total written to all SSDs.
Amplification:48!!!

Le ouch.
In my use case with rbd cache on all VMs I expect writes to be rather
large for the most part and not like this extreme example.
But as I wrote the last time I did this kind of testing, this is an area
where caveat emptor most definitely applies when planning and buying
SSDs. And where the Ceph code could probably do with some attention.

Regards,

Christian

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com