Re: SSD test results with Plextor M6 Pro, HyperX Fury, Kingston V300, ADATA SP90

Jan Schermer <jan@xxxxxxxxxxx> · Wed, 2 Sep 2015 09:48:28 +0200

Hi,
comments below

> On 01 Sep 2015, at 18:08, Jelle de Jong <jelledejong@xxxxxxxxxxxxx> wrote:
> 
> Hi Jan,
> 
> I am building two new clusters for testing. I been reading your messages
> on the mailing list for a while now and want to thank you for your support.
> 
> I can redo all the numbers, but is your question to run all the test
> again with [hdparm -W 1 /dev/sdc]? Please tell me what else you would
> like to see test, commands?
> 

Probably not necessary, I figure lots of stuff out.

> My experience was that enabling disk cache causes about a 45%
> performance drop, iops=25690 vs iops=46185
> 

This very much depends on the controller - low end "smart" HBAs (like some RAID controllers) are limited in their clock speed so while they can handle relatively large throughput they also introduce latency and this translates into less "IOPS".
Enabling write cache and benchmarking a synchronous workload causes the IOPS to go up by a factor of 2, which is almost exactly what you're seeing.

See my comment below on power loss protection.

> I am going to test DT01ACA300 vs WD1003FBYZ disks with SV300S37A ssd's
> in my other two three node ceph clusters.
> 
> What is your advice on making hdparm and possible scheduler (noop)
> changes persistent (cmd in rc.local or special udev rules, examples?)
> 

We do that with puppet that runs every few minutes. Use whatever tool you have.

The correct way is via udev on hotplug, since that eliminates any window where it is set incorrectly, but it is slightly distribution specific.

You can also pass "elevator=noop" to kernel cmdline, which makes noop the default for all devices, then you can just re-set that for your non-OSD drives which are not likely to be hotplugged... not 100% solution if a drive is replaced though.

See more comments below

> Kind regards,
> 
> Jelle de Jong
> 
> 
> On 23/06/15 12:41, Jan Schermer wrote:
>> Those are interesting numbers - can you rerun the test with write cache enabled this time? I wonder how much your drop will be…
>> 
>> thanks
>> 
>> Jan
>> 
>>> On 18 Jun 2015, at 17:48, Jelle de Jong <jelledejong@xxxxxxxxxxxxx> wrote:
>>> 
>>> Hello everybody,
>>> 
>>> I thought I would share the benchmarks from these four ssd's I tested
>>> (see attachment)
>>> 
>>> I do still have some question:
>>> 
>>> #1     *    Data Set Management TRIM supported (limit 1 block)
>>>   vs
>>>      *    Data Set Management TRIM supported (limit 8 blocks)
>>> and how this effects Ceph and also how can I test if TRIM is actually
>>> working and not corruption data.
>>> 

This by itself means nothing, it just says how many blocks are TRIMmed in one OP. The trimming itself could be fast or slow, and I have not seen a clear correlation between the TRIM parameters and speed of TRIM itself.
CEPH doesn't trim anything, this is a job for filesystem.

I recommend _disabling_ filesystem trimming as a mount options because it causes big overhead in writes.
It's much better to schedule a daily/weekly fstrim cronjob to do that, and even then only discard large blocks (fstrim -m 131072 or similiar) - fstrim can in some cases cause the SSD to pause IO for significant amounts of time, so test how it behaves after filling and erasing the drive.
However, it's not really that necesary to use TRIM with modern SSDs, unless you want to squeeze more endurance than the drive is rated for, and then it should be combined with under-provisioning or simply partitioning only part of the drive (both after one-time TRIM!).

>>> #2 are there other things I should test to compare ssd's for Ceph Journals
>>> 

Test if the drive performance is consistent. Some drives fill the "cache" part (sometimes SLC or eMLC) of the NAND and then the throughput drops significantly. Some drives perform garbage collection that causes periodic spikes/drops of performance, some drives simply slow down when many blocks are dirty... 
For example I have been running a fio job on Intel S3610 almost non-stop for the past week, and the performance increased from 17K IOPS to 21K IOPS :-)
Samsung drives also sped up with time.

If you have a database server that performs many transactions and it's possible to put the SSD in there, do that. I can't think of a better test. You should know how it behaves after a few weeks.
You can google various fio jobs simulating workloads, or you can write your own scripts - fio is very powerful, but it's still only a synthetic test.

>>> #3 are the power loss security mechanisms on SSD relevant in Ceph when
>>> configured in a way that a full node can fully die and that a power loss
>>> of all nodes at the same time should not be possible (or has an extreme
>>> low probability)

Depends :-)

When the system asks SSD to flush the data it should always flush it to the platter/NAND. The standard doesn't allow exceptions*, even for devices with non-volatile cache.
In practice, controllers and SANs with non-volatile cache ignore flushes unless specifically told not to and that's what gives them the performance we expect.
Some SSDs and many consumer HDDs ignore flushes even though they have volatile cache - this can and will cause data loss when the server goes down.

In this respect, SSDs with power loss protection will probably ignore flushes, but I haven't been able to reliably test that. I don't think we need to worry about it, really.
Supposedly, when the controller/SAN/SSD has a non-volatile cache, we can disable barriers and get a nice speed boost for the filesystem - I'm not sure that is the case, though. Disabling the barriers not only causes OS to not send flushes (which would be ignored anyway) but it also doesn't flush the OS pagecache and dirty blocks to the disk (cache) at all and can cause the writes to be reordered. If the power goes down the SSD cache is intact, but some writes have not even reached it.
The safest well performing combination is thus leaving barriers/flushes on and having a non-volatile cache on the drives. 

TL;DR - don't buy SSDs without power loss protection for servers.

*someone correct me if I'm wrong, haven't looked at that for a while :)

>>> 
>>> #4 how to benchmarks the OSD (disk+ssd-journal) combination so I can
>>> compare them.

Someone else should answer that :-) rados bench, fio in a VM, fio on a RBD device...

>>> 
>>> I got some other benchmarks question, but I will make an separate mail
>>> for them.
>>> 
>>> Kind regards,
>>> 
>>> Jelle de Jong
>>> <setup-ceph01-ceph-ssd-benchmark.txt>_______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com