Write endurance is kinda bullshit. We have crucial 960gb drives storing data and we've only managed to take 2% off the drives life in the period of a year and hundreds of tb written weekly. Stuff is way more durable than anyone gives it credit. ----- Original Message ----- From: "Lionel Bouton" <lionel+ceph@xxxxxxxxxxx> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx> Sent: Tuesday, December 22, 2015 11:04:26 AM Subject: Re: Intel S3710 400GB and Samsung PM863 480GB fio results Le 22/12/2015 13:43, Andrei Mikhailovsky a écrit : > Hello guys, > > Was wondering if anyone has done testing on Samsung PM863 120 GB version to see how it performs? IMHO the 480GB version seems like a waste for the journal as you only need to have a small disk size to fit 3-4 osd journals. Unless you get a far greater durability. The problem is endurance. If we use the 480GB for 3 OSDs each on the cluster we might build we expect 3 years (with some margin for error but not including any write amplification at the SSD level) before the SSDs will fail. In our context a 120GB model might not even last a year (endurance is 1/4th of the 480GB model). This is why SM863 models will probably be more suitable if you have access to them: you can use smaller ones which cost less and get more endurance (you'll have to check the performance though, usually smaller models have lower IOPS and bandwidth). > I am planning to replace my current journal ssds over the next month or so and would like to find out if there is an a good alternative to the Intel's 3700/3500 series. 3700 are a safe bet (the 100GB model is rated for ~1.8PBW). 3500 models probably don't have enough endurance for many Ceph clusters to be cost effective. The 120GB model is only rated for 70TBW and you have to consider both client writes and rebalance events. I'm uneasy with SSDs expected to fail within the life of the system they are in: you can have a cascade effect where an SSD failure brings down several OSDs triggering a rebalance which might make SSDs installed at the same time fail too. In this case in the best scenario you will reach your min_size (>=2) and block any writes which would prevent more SSD failures until you move journals to fresh SSDs. If min_size = 1 you might actually lose data. If you expect to replace your current journal SSDs if I were you I would make a staggered deployment over several months/a year to avoid them failing at the same time in case of an unforeseen problem. In addition this would allow to evaluate the performance and behavior of a new SSD model with your hardware (there have been reports of performance problems with some combinations of RAID controllers and SSD models/firmware versions) without impacting your cluster's overall performance too much. When using SSDs for journals you have to monitor both : * the SSD wear leveling or something equivalent (SMART data may not be available if you use a RAID controller but usually you can get the total amount data written) of each SSD, * the client writes on the whole cluster. And check periodically what the expected lifespan left there is for each of your SSD based on their current state, average write speed, estimated write amplification (both due to pool's size parameter and the SSD model's inherent write amplification) and the amount of data moved by rebalance events you expect to happen. Ideally you should make this computation before choosing the SSD models, but several variables are not always easy to predict and probably will change during the life of your cluster. Lionel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com