Re: SSD test results with Plextor M6 Pro, HyperX Fury, Kingston V300, ADATA SP90

Christian Balzer <chibi@xxxxxxx> · Fri, 19 Jun 2015 10:28:59 +0900

Hello,

On Thu, 18 Jun 2015 17:48:12 +0200 Jelle de Jong wrote:

> Hello everybody,
> 
> I thought I would share the benchmarks from these four ssd's I tested
> (see attachment)
> 

Neither of these are DC level SSDs of course, though the HyperX at least
supposedly can handle 2.5 DWPD.
Alas that info is only on the the PDF, not the web page specifications and
that PDF also says "not for servers, no siree".
Which can mean a lot of things, the worst would be something like going
_very_ slow when doing housekeeping or the likes.

> I do still have some question:
> 
> #1     *    Data Set Management TRIM supported (limit 1 block)
>     vs
>        *    Data Set Management TRIM supported (limit 8 blocks)
> and how this effects Ceph and also how can I test if TRIM is actually
> working and not corruption data.
> 

I would not deploy any SSDs that actually require TRIM to maintain their
speed or TBW endurance. 
And I wouldn't want Ceph to do TRIMs due to the corruption issues you
already are aware of.
And last but not least, TRIM makes little to no sense with Ceph journals.
These are raw partitions, so Ceph would need to issue the TRIM commands.
And they are constantly being overwritten, trimming them would be
detrimental to the performance for sure.

> #2 are there other things I should test to compare ssd's for Ceph
> Journals
> 
TBW/$. I couldn't find the endurance data for the Plextor at all.
I have a cluster with journal SSDs that experience average 2MB/s writes,
so in 5 years that makes 315TB. Just shy of the 354TB the 128GB HyperX
promises. 
First rule of engineering, overspec by at lest 100%, so the 240GB model
would be a fit. If one were to use such drives in the first place.

> #3 are the power loss security mechanisms on SSD relevant in Ceph when
> configured in a way that a full node can fully die and that a power loss
> of all nodes at the same time should not be possible (or has an extreme
> low probability)
> 
A full node death is often something you can recover from much faster than
a dead OSD (usually no data loss, just reboot it) and if Ceph is configured
correctly (mon_osd_down_out_subtree_limit = host) with very little impact
when it comes back.

If your journals are hosed because of a power loss, all the associated
OSDs are dead until you either recreate the journal (if possible) or in
the worst case (OSD HDD also hosed) the entire OSD.

That said, I personally consider total power loss scenarios in the DCs we
use to be very, very unlikely as well. Others here will strongly disagree
with that, based on their experience.
Penultimately that doesn't stop folks from accidentally powering off or
unplugging servers.
And I have seen SSDs w/o power loss protection getting hosed in such
scenarios while ones with it had no issues.

> #4 how to benchmarks the OSD (disk+ssd-journal) combination so I can
> compare them.
> 
There are plenty of examples in the archives, from rados bench to fio with
rbd ioengine to running fio in a VM (for most people the most realistic
test). Block size will have of course a dramatic impact on throughput,
IOPS and CPU utilization.

The fio and dd tests you did are an indication of the capabilities of
those SSDs, those numbers however don't translate directly to Ceph.

Also, once your SSDs are fast enough to ACK things in a timely fashion,
your HDDs will become the bottleneck with persistent loads.

For example in my cluster with a 2 journals per SSD (DC S3700 100GB) a fio
run with 4K blocks will quickly get the CPUs sweating, the HDDs to 100%
utilization and the SSDs to about 10%. 
However with 4M blocks the CPUs are nearly bored, the HDDs of course at
about 100% and the SSD are going up to 40% (they are approaching their
throughput/bandwidth limit of 200MB/s, not IOPS). With rados bench I can
push the SSDs to 70%, which is one of the reasons I postulate that HDDs
(of the 7.2K RPM SATA persuasion) won't be doing much over 80MB/s in the
best case scenario when being used as OSDs.

Regards,

Christian

> I got some other benchmarks question, but I will make an separate mail
> for them.
> 
> Kind regards,
> 
> Jelle de Jong

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com