Re: Intel S3710 400GB and Samsung PM863 480GB fio results

Alan Johnson <alanj@xxxxxxxxxxxxxx> · Tue, 22 Dec 2015 16:57:21 +0000

I would also add that the journal activity is write intensive so a small part of the drive would get excessive writes if the journal and data are co-located on an SSD. This would also be the case where an SSD has multiple journals associated with many HDDs.

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Wido den Hollander
Sent: Tuesday, December 22, 2015 11:46 AM
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Intel S3710 400GB and Samsung PM863 480GB fio results

On 12/22/2015 05:36 PM, Tyler Bishop wrote:
> Write endurance is kinda bullshit.
> 
> We have crucial 960gb drives storing data and we've only managed to take 2% off the drives life in the period of a year and hundreds of tb written weekly.
> 
> 
> Stuff is way more durable than anyone gives it credit.
> 
> 

No, that is absolutely not true. I've seen multiple SSDs fail in Ceph clusters. Small Samsung 850 Pro SSDs worn out within 4 months in heavy write-intensive Ceph clusters.

> ----- Original Message -----
> From: "Lionel Bouton" <lionel+ceph@xxxxxxxxxxx>
> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>, "ceph-users" 
> <ceph-users@xxxxxxxxxxxxxx>
> Sent: Tuesday, December 22, 2015 11:04:26 AM
> Subject: Re:  Intel S3710 400GB and Samsung PM863 480GB 
> fio results
> 
> Le 22/12/2015 13:43, Andrei Mikhailovsky a écrit :
>> Hello guys,
>>
>> Was wondering if anyone has done testing on Samsung PM863 120 GB version to see how it performs? IMHO the 480GB version seems like a waste for the journal as you only need to have a small disk size to fit 3-4 osd journals. Unless you get a far greater durability.
> 
> The problem is endurance. If we use the 480GB for 3 OSDs each on the 
> cluster we might build we expect 3 years (with some margin for error 
> but not including any write amplification at the SSD level) before the 
> SSDs will fail.
> In our context a 120GB model might not even last a year (endurance is 
> 1/4th of the 480GB model). This is why SM863 models will probably be 
> more suitable if you have access to them: you can use smaller ones 
> which cost less and get more endurance (you'll have to check the 
> performance though, usually smaller models have lower IOPS and bandwidth).
> 
>> I am planning to replace my current journal ssds over the next month or so and would like to find out if there is an a good alternative to the Intel's 3700/3500 series. 
> 
> 3700 are a safe bet (the 100GB model is rated for ~1.8PBW). 3500 
> models probably don't have enough endurance for many Ceph clusters to 
> be cost effective. The 120GB model is only rated for 70TBW and you 
> have to consider both client writes and rebalance events.
> I'm uneasy with SSDs expected to fail within the life of the system 
> they are in: you can have a cascade effect where an SSD failure brings 
> down several OSDs triggering a rebalance which might make SSDs 
> installed at the same time fail too. In this case in the best scenario 
> you will reach your min_size (>=2) and block any writes which would 
> prevent more SSD failures until you move journals to fresh SSDs. If 
> min_size = 1 you might actually lose data.
> 
> If you expect to replace your current journal SSDs if I were you I 
> would make a staggered deployment over several months/a year to avoid 
> them failing at the same time in case of an unforeseen problem. In 
> addition this would allow to evaluate the performance and behavior of 
> a new SSD model with your hardware (there have been reports of 
> performance problems with some combinations of RAID controllers and 
> SSD models/firmware versions) without impacting your cluster's overall 
> performance too much.
> 
> When using SSDs for journals you have to monitor both :
> * the SSD wear leveling or something equivalent (SMART data may not be 
> available if you use a RAID controller but usually you can get the 
> total amount data written) of each SSD,
> * the client writes on the whole cluster.
> And check periodically what the expected lifespan left there is for 
> each of your SSD based on their current state, average write speed, 
> estimated write amplification (both due to pool's size parameter and 
> the SSD model's inherent write amplification) and the amount of data 
> moved by rebalance events you expect to happen.
> Ideally you should make this computation before choosing the SSD 
> models, but several variables are not always easy to predict and 
> probably will change during the life of your cluster.
> 
> Lionel
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com