Re: Intel S3710 400GB and Samsung PM863 480GB fio results

Tyler Bishop <tyler.bishop@xxxxxxxxxxxxxxxxx> · Tue, 22 Dec 2015 11:36:21 -0500 (EST)

Write endurance is kinda bullshit.

We have crucial 960gb drives storing data and we've only managed to take 2% off the drives life in the period of a year and hundreds of tb written weekly.

Stuff is way more durable than anyone gives it credit.

----- Original Message -----
From: "Lionel Bouton" <lionel+ceph@xxxxxxxxxxx>
To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Sent: Tuesday, December 22, 2015 11:04:26 AM
Subject: Re:  Intel S3710 400GB and Samsung PM863 480GB fio results

Le 22/12/2015 13:43, Andrei Mikhailovsky a écrit :
> Hello guys,
>
> Was wondering if anyone has done testing on Samsung PM863 120 GB version to see how it performs? IMHO the 480GB version seems like a waste for the journal as you only need to have a small disk size to fit 3-4 osd journals. Unless you get a far greater durability.

The problem is endurance. If we use the 480GB for 3 OSDs each on the
cluster we might build we expect 3 years (with some margin for error but
not including any write amplification at the SSD level) before the SSDs
will fail.
In our context a 120GB model might not even last a year (endurance is
1/4th of the 480GB model). This is why SM863 models will probably be
more suitable if you have access to them: you can use smaller ones which
cost less and get more endurance (you'll have to check the performance
though, usually smaller models have lower IOPS and bandwidth).

> I am planning to replace my current journal ssds over the next month or so and would like to find out if there is an a good alternative to the Intel's 3700/3500 series. 

3700 are a safe bet (the 100GB model is rated for ~1.8PBW). 3500 models
probably don't have enough endurance for many Ceph clusters to be cost
effective. The 120GB model is only rated for 70TBW and you have to
consider both client writes and rebalance events.
I'm uneasy with SSDs expected to fail within the life of the system they
are in: you can have a cascade effect where an SSD failure brings down
several OSDs triggering a rebalance which might make SSDs installed at
the same time fail too. In this case in the best scenario you will reach
your min_size (>=2) and block any writes which would prevent more SSD
failures until you move journals to fresh SSDs. If min_size = 1 you
might actually lose data.

If you expect to replace your current journal SSDs if I were you I would
make a staggered deployment over several months/a year to avoid them
failing at the same time in case of an unforeseen problem. In addition
this would allow to evaluate the performance and behavior of a new SSD
model with your hardware (there have been reports of performance
problems with some combinations of RAID controllers and SSD
models/firmware versions) without impacting your cluster's overall
performance too much.

When using SSDs for journals you have to monitor both :
* the SSD wear leveling or something equivalent (SMART data may not be
available if you use a RAID controller but usually you can get the total
amount data written) of each SSD,
* the client writes on the whole cluster.
And check periodically what the expected lifespan left there is for each
of your SSD based on their current state, average write speed, estimated
write amplification (both due to pool's size parameter and the SSD
model's inherent write amplification) and the amount of data moved by
rebalance events you expect to happen.
Ideally you should make this computation before choosing the SSD models,
but several variables are not always easy to predict and probably will
change during the life of your cluster.

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com