Re: Sharing SSD journals and SSD drive choice

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I can attest to this.  I had a cluster that used 3510's for the first rack and then switched to 3710's after that.  We had 3TB drives and every single 3510 ran out of writes after 1.5 years.  We noticed because we tracked down incredibly slow performance to a subset of OSDs and each time they had a common journal.  This happened for about 2 weeks and 4 journals,  That was when we realized that they were all 3510 journals and SMART showed not only the journals we had tracked down, but all of the 3510's were out of writes.  Replacing all of your journals every 1.5 years is way more expensive than the increased cost of the 3710's.  That was our use case and experience, but I'm pretty sure that any cluster large enough to fill at least most of a rack will run into this much sooner than later.

On Mon, May 1, 2017 at 11:15 AM Maxime Guyot <maxime@xxxxxxxxxxx> wrote:
Hi,

Lots of good info on SSD endurance in this thread.

For Ceph journal you should also consider the size of the backing OSDs: the SSD journal won't last as long if backing 5x8TB OSDs or 5x1TB OSDs.

For example, the S3510 480GB (275TB of endurance), if backing 5x8TB (40TB) OSDs, will provide very little endurance, assuming triple replication you will be able to fill the OSDs twice and that's about it (275/(5x8)/3).
On the other end of the scale a 1.2TB S3710 backing 5x1TB will be able to fill them 1620 times before running out of endurance (24300/(5x1)/3).

Ultimately it depends on your workload. Some people can get away with S3510 as journals if the workload is read intensive, but in most cases the higher endurance is a safe bet (S3710 or S3610).

Cheers,
Maxime


On Mon, 1 May 2017 at 11:04 Jens Dueholm Christensen <JEDC@xxxxxxxxxxx> wrote:
Sorry for topposting, but..

The Intel 35xx drives are rated for a much lower DWPD (drive-writes-per-day) than the 36xx or 37xx models.

Keep in mind that a single SSD that acts as journal for 5 OSDs will recieve ALL writes for those 5 OSDs before the data is moved off to the OSDs actual data drives.

This makes for quite a lot of writes, and along with the consumer/enterprise advice others have written about, your SSD journal devices will recieve quite a lot of writes over time.

The S3510 is rated for 0.3 DWPD for 5 years (http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3510-spec.html)
The S3610 is rated for 3 DWPD for 5 years  (http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3610-spec.html)
The S3710 is rated for 10 DWPD for 5 years (http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3710-spec.html)

A 480GB S3510 has no endurance left once you have written 0.275PB to it.
A 480GB S3610 has no endurance left once you have written 3.7PB to it.
A 400GB S3710 has no endurance left once you have written 8.3PB to it.

This makes for quite a lot of difference over time - even if a S3510 wil only act as journal for 1 or 2 OSDs, it will wear out much much much faster than others.

And I know I've used the xx10 models above, but the xx00 models have all been replaced by those newer models now.

And yes, the xx10 models are using MLC NAND, but so were the xx00 models, that have a proven trackrecord and delivers what Intel promised in the datasheet.

You could try and take a look at some of the enterprise SSDs that Samsung has launched.
Price-wise they are very competitive to Intel, but I want to see (or at least hear from others) if they can deliver what their datasheet promises.
Samsungs consumer SSDs did not (840/850 Pro), so I'm only using S3710s in my cluster.


Before I created our own cluster some time ago, I found these threads from the mailinglist regarding the exact same disks we had been expecting to use (Samsung 840/850 Pro), that was quickly changed to Intel S3710s:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044258.html
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg17369.html

A longish thread about Samsung consumer drives:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000572.html
- highlights from that thread:
  - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000610.html
  - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000611.html
  - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000798.html

Regards,
Jens Dueholm Christensen
Rambøll Survey IT

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Adam Carheden
Sent: Wednesday, April 26, 2017 5:54 PM
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re: Sharing SSD journals and SSD drive choice

Thanks everyone for the replies.

I will be avoiding TLC drives, it was just something easy to benchmark
with existing equipment. I hadn't though of unscrupulous data durability
lies or performance suddenly tanking in unpredictable ways. I guess it
all comes down to trusting the vendor since it would be expensive in
time and $$ to test for such things.

Any thoughts on multiple Intel 35XX vs a single 36XX/37XX? All have "DC"
prefixes and are listed in the Data Center section of their marketing
pages, so I assume they'll all have the same quality underlying NAND.

--
Adam Carheden


On 04/26/2017 09:20 AM, Chris Apsey wrote:
> Adam,
>
> Before we deployed our cluster, we did extensive testing on all kinds of
> SSDs, from consumer-grade TLC SATA all the way to Enterprise PCI-E NVME
> Drives.  We ended up going with a ratio of 1x Intel P3608 PCI-E 1.6 TB
> to 12x HGST 10TB SAS3 HDDs.  It provided the best
> price/performance/density balance for us overall.  As a frame of
> reference, we have 384 OSDs spread across 16 nodes.
>
> A few (anecdotal) notes:
>
> 1. Consumer SSDs have unpredictable performance under load; write
> latency can go from normal to unusable with almost no warning.
> Enterprise drives generally show much less load sensitivity.
> 2. Write endurance; while it may appear that having several
> consumer-grade SSDs backing a smaller number of OSDs will yield better
> longevity than an enterprise grade SSD backing a larger number of OSDs,
> the reality is that enterprise drives that use SLC or eMLC are generally
> an order of magnitude more reliable when all is said and done.
> 3. Power Loss protection (PLP).  Consumer drives generally don't do well
> when power is suddenly lost.  Yes, we should all have UPS, etc., but
> things happen.  Enterprise drives are much more tolerant of
> environmental failures.  Recovering from misplaced objects while also
> attempting to serve clients is no fun.
>
>
>
>
>
> ---
> v/r
>
> Chris Apsey
> bitskrieg@xxxxxxxxxxxxx
> https://www.bitskrieg.net
>
> On 2017-04-26 10:53, Adam Carheden wrote:
>> What I'm trying to get from the list is /why/ the "enterprise" drives
>> are important. Performance? Reliability? Something else?
>>
>> The Intel was the only one I was seriously considering. The others were
>> just ones I had for other purposes, so I thought I'd see how they fared
>> in benchmarks.
>>
>> The Intel was the clear winner, but my tests did show that throughput
>> tanked with more threads. Hypothetically, if I was throwing 16 OSDs at
>> it, all with osd op threads = 2, do the benchmarks below not show that
>> the Hynix would be a better choice (at least for performance)?
>>
>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously
>> the single drive leaves more bays free for OSD disks, but is there any
>> other reason a single S3610 is preferable to 4 S3520s? Wouldn't 4xS3520s
>> mean:
>>
>> a) fewer OSDs go down if the SSD fails
>>
>> b) better throughput (I'm speculating that the S3610 isn't 4 times
>> faster than the S3520)
>>
>> c) load spread across 4 SATA channels (I suppose this doesn't really
>> matter since the drives can't throttle the SATA bus).
>>
>>
>> --
>> Adam Carheden
>>
>> On 04/26/2017 01:55 AM, Eneko Lacunza wrote:
>>> Adam,
>>>
>>> What David said before about SSD drives is very important. I will tell
>>> you another way: use enterprise grade SSD drives, not consumer grade.
>>> Also, pay attention to endurance.
>>>
>>> The only suitable drive for Ceph I see in your tests is SSDSC2BB150G7,
>>> and probably it isn't even the most suitable SATA SSD disk from Intel;
>>> better use S3610 o S3710 series.
>>>
>>> Cheers
>>> Eneko
>>>
>>> El 25/04/17 a las 21:02, Adam Carheden escribió:
>>>> On 04/25/2017 11:57 AM, David wrote:
>>>>> On 19 Apr 2017 18:01, "Adam Carheden" <carheden@xxxxxxxx
>>>>> <mailto:carheden@xxxxxxxx>> wrote:
>>>>>
>>>>>      Does anyone know if XFS uses a single thread to write to it's
>>>>> journal?
>>>>>
>>>>>
>>>>> You probably know this but just to avoid any confusion, the journal in
>>>>> this context isn't the metadata journaling in XFS, it's a separate
>>>>> journal written to by the OSD daemons
>>>> Ha! I didn't know that.
>>>>
>>>>> I think the number of threads per OSD is controlled by the 'osd op
>>>>> threads' setting which defaults to 2
>>>> So the ideal (for performance) CEPH cluster would be one SSD per HDD
>>>> with 'osd op threads' set to whatever value fio shows as the optimal
>>>> number of threads for that drive then?
>>>>
>>>>> I would avoid the SanDisk and Hynix. The s3500 isn't too bad. Perhaps
>>>>> consider going up to a 37xx and putting more OSDs on it. Of course
>>>>> with
>>>>> the caveat that you'll lose more OSDs if it goes down.
>>>> Why would you avoid the SanDisk and Hynix? Reliability (I think those
>>>> two are both TLC)? Brand trust? If it's my benchmarks in my previous
>>>> email, why not the Hynix? It's slower than the Intel, but sort of
>>>> decent, at lease compared to the SanDisk.
>>>>
>>>> My final numbers are below, including an older Samsung Evo (MCL I
>>>> think)
>>>> which did horribly, though not as bad as the SanDisk. The Seagate is a
>>>> 10kRPM SAS "spinny" drive I tested as a control/SSD-to-HDD comparison.
>>>>
>>>>           SanDisk SDSSDA240G, fio  1 jobs:   7.0 MB/s (5 trials)
>>>>
>>>>
>>>>           SanDisk SDSSDA240G, fio  2 jobs:   7.6 MB/s (5 trials)
>>>>
>>>>
>>>>           SanDisk SDSSDA240G, fio  4 jobs:   7.5 MB/s (5 trials)
>>>>
>>>>
>>>>           SanDisk SDSSDA240G, fio  8 jobs:   7.6 MB/s (5 trials)
>>>>
>>>>
>>>>           SanDisk SDSSDA240G, fio 16 jobs:   7.6 MB/s (5 trials)
>>>>
>>>>
>>>>           SanDisk SDSSDA240G, fio 32 jobs:   7.6 MB/s (5 trials)
>>>>
>>>>
>>>>           SanDisk SDSSDA240G, fio 64 jobs:   7.6 MB/s (5 trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio  1 jobs:   4.2 MB/s (5 trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio  2 jobs:   0.6 MB/s (5 trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio  4 jobs:   7.5 MB/s (5 trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio  8 jobs:  17.6 MB/s (5 trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio 16 jobs:  32.4 MB/s (5 trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio 32 jobs:  64.4 MB/s (5 trials)
>>>>
>>>>
>>>> HFS250G32TND-N1A2A 30000P10, fio 64 jobs:  71.6 MB/s (5 trials)
>>>>
>>>>
>>>>                  SAMSUNG SSD, fio  1 jobs:   2.2 MB/s (5 trials)
>>>>
>>>>
>>>>                  SAMSUNG SSD, fio  2 jobs:   3.9 MB/s (5 trials)
>>>>
>>>>
>>>>                  SAMSUNG SSD, fio  4 jobs:   7.1 MB/s (5 trials)
>>>>
>>>>
>>>>                  SAMSUNG SSD, fio  8 jobs:  12.0 MB/s (5 trials)
>>>>
>>>>
>>>>                  SAMSUNG SSD, fio 16 jobs:  18.3 MB/s (5 trials)
>>>>
>>>>
>>>>                  SAMSUNG SSD, fio 32 jobs:  25.4 MB/s (5 trials)
>>>>
>>>>
>>>>                  SAMSUNG SSD, fio 64 jobs:  26.5 MB/s (5 trials)
>>>>
>>>>
>>>>          INTEL SSDSC2BB150G7, fio  1 jobs:  91.2 MB/s (5 trials)
>>>>
>>>>
>>>>          INTEL SSDSC2BB150G7, fio  2 jobs: 132.4 MB/s (5 trials)
>>>>
>>>>
>>>>          INTEL SSDSC2BB150G7, fio  4 jobs: 138.2 MB/s (5 trials)
>>>>
>>>>
>>>>          INTEL SSDSC2BB150G7, fio  8 jobs: 116.9 MB/s (5 trials)
>>>>
>>>>
>>>>          INTEL SSDSC2BB150G7, fio 16 jobs:  61.8 MB/s (5 trials)
>>>>          INTEL SSDSC2BB150G7, fio 32 jobs:  22.7 MB/s (5 trials)
>>>>          INTEL SSDSC2BB150G7, fio 64 jobs:  16.9 MB/s (5 trials)
>>>>          SEAGATE ST9300603SS, fio  1 jobs:   0.7 MB/s (5 trials)
>>>>          SEAGATE ST9300603SS, fio  2 jobs:   0.9 MB/s (5 trials)
>>>>          SEAGATE ST9300603SS, fio  4 jobs:   1.6 MB/s (5 trials)
>>>>          SEAGATE ST9300603SS, fio  8 jobs:   2.0 MB/s (5 trials)
>>>>          SEAGATE ST9300603SS, fio 16 jobs:   4.6 MB/s (5 trials)
>>>>          SEAGATE ST9300603SS, fio 32 jobs:   6.9 MB/s (5 trials)
>>>>          SEAGATE ST9300603SS, fio 64 jobs:   0.6 MB/s (5 trials)
>>>>
>>>> For those who come across this and are looking for drives for purposes
>>>> other than CEPH, those are all sequential write numbers with caching
>>>> disabled, a very CEPH-journal-specific test. The SanDisk held it's own
>>>> against the Intel using some benchmarks on Windows that didn't disable
>>>> caching. It may very well be a perfectly good drive for other purposes.
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux