Re: SSD selection

Christian Balzer <chibi@xxxxxxx> · Mon, 2 Mar 2015 13:18:22 +0900

On Sun, 1 Mar 2015 21:26:16 -0600 Tony Harris wrote:

> On Sun, Mar 1, 2015 at 6:32 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> >
> > Again, penultimately you will need to sit down, compile and compare the
> > numbers.
> >
> > Start with this:
> > http://ark.intel.com/products/family/83425/Data-Center-SSDs
> >
> > Pay close attention to the 3610 SSDs, while slightly more expensive
> > they offer 10 times the endurance.
> >
> 
> Unfortunately, $300 vs $100 isn't really slightly more expensive ;)
>  Although I did notice that the 3710's can be gotten for ~210.
> 
> 
I'm not sure where you get those prices from or what you're comparing with
what but if you look at the OEM prices in the URL up there (which compare
quite closely to what you can find when looking at shopping prices) a
comparison with closely matched capabilities goes like this:

http://ark.intel.com/compare/71913,86640,75680,75679

You really wouldn't want less than 200MB/s, even in your setup which I
take to be 2Gb/s from what you wrote below.
Note that the 100GB 3700 is going to perform way better and last immensely
longer than the 160GB 3500 while being moderately more expensive, while
the the 200GB 3610 is faster (IOPS), lasting 10 times long AND cheaper than
the 240GB 3500.

It is pretty much those numbers that made me use 4 100GB 3700s instead of
3500s (240GB), much more bang for the buck and it still did fit my budget
and could deal with 80% of the network bandwidth.

> 
> >
> > Guestimate the amount of data written to your cluster per day, break
> > that down to the load a journal SSD will see and then multiply by at
> > least 5 to be on the safe side. Then see which SSD will fit your
> > expected usage pattern.
> >
> 
> Luckily I don't think there will be a ton of data per day written.  The
> majority of servers whose VHDs will be stored in our cluster don't have a
> lot of frequent activity - aside from a few windows servers that have DBs
> servers in them (and even they don't write a ton of data per day really).
> 

Being able to put even a coarse number on this will tell you if you can
skim on the endurance and have your cluster last like 5 years or if
getting a higher endurance SSD is going to be cheaper.

> 
> 
> >
> > You didn't mention your network, but I assume it's 10Gb/s?
> >
> 
> Would be nice, if I had access to the kind of cash to get a 10Gb
> network, I wouldn't be stressing the cost of a set of SSDs ;)
> 
So it's 2x1Gb/s then?

At that speed a single SSD from the list above would do, if you're 
a) aware of the risk that this SSD failing will kill all OSDs on that node
and
b) don't expect your cluster to be upgraded 

> 
> >
> > At 135MB/s writes the 100GB DC S3500 will not cut the mustard in any
> > shape or form when journaling for 4 HDDs.
> > With 2 HDDs it might be a so-so choice, but still falling short.
> > Most currenth 7.2K RPM HDDs these days can do around 150MB/s writes,
> > however that's neither uniform, nor does Ceph do anything resembling a
> > sequential write (which is where these speeds come from), so in my book
> > 80-120MB/s on the SSD journal per HDD are enough.
> >
> 
> The drives I have access to that are in the cluster aren't the fastest,
> current drives out there; but what you're describing, to have even 3
> HDD's per SSD, you'd need an SSD running 240-360MB/s write
> capability...  Why does the ceph documentation then talk 1ssd per 4-5
> osd drives?  It would be near impossible to get an SSD to meet that
> level of speeds..
> 
They do exist, albeit not cheaply. 
2 400GB 3700s will nearly saturate a 10Gb/s link.

Once your SSD(s) can handle the full network bandwidth, there's no point
adding more. 
Also the ability to do vastly more IOPS is for most people the key point
deploying SSD journals, other than very few use cases and during backfills
their bandwidth (write speed) isn't all that important.

> 
> >
> > A speed hit is one thing, more than halving your bandwidth is bad,
> > especially when thinking about backfilling.
> >
> 
> Although I'm working with more than 1Gb/s, it's a lot less than 10Gb/s,
> so there might be a threshold there where we wouldn't experience an issue
> where someone using 10G would (God I'd love a 10G network, but no budget
> for it)
>
Correct.

> 
> >
> > Journal size doesn't matter that much, 10GB is fine, 20GB x4 is OK with
> > the 100GB DC drives, with 5xx consumer models I'd leave at least 50%
> > free.
> >
> 
> Well, I'd like to steer away from the consumer models if possible since
> they (AFAIK) don't contain caps to finish writes should a power loss
> occur, unless there is one that does?
> 
Not that I'm aware of. 

Also note that while Andrei is happy with his 520s (especially compared to
the Samsungs) I have various 5x0 Intel SSDs in use as well and while they
are quite nice the 3700s are so much faster (consistently) in comparison
that one can't believe it ain't butter. ^o^

Christian

> -Tony
> 
> 
> >
> > Christian
> >
> > On Sun, 1 Mar 2015 15:08:10 -0600 Tony Harris wrote:
> >
> > > Now, I've never setup a journal on a separate disk, I assume you
> > > have 4 partitions at 10GB / partition, I noticed in the docs they
> > > referred to 10 GB, as a good starter.  Would it be better to have 4
> > > partitions @ 10g ea or 4 @20?
> > >
> > > I know I'll take a speed hit, but unless I can get my work to buy the
> > > drives, they will have to sit with what my personal budget can
> > > afford and be willing to donate ;)
> > >
> > > -Tony
> > >
> > > On Sun, Mar 1, 2015 at 2:54 PM, Andrei Mikhailovsky
> > > <andrei@xxxxxxxxxx> wrote:
> > >
> > > > I am not sure about the enterprise grade and underprovisioning,
> > > > but for the Intel 520s i've got 240gbs (the speeds of 240 is a bit
> > > > better than 120s). and i've left 50% underprovisioned. I've got
> > > > 10GB for journals and I am using 4 osds per ssd.
> > > >
> > > > Andrei
> > > >
> > > >
> > > > ------------------------------
> > > >
> > > > *From: *"Tony Harris" <nethfel@xxxxxxxxx>
> > > > *To: *"Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
> > > > *Cc: *ceph-users@xxxxxxxxxxxxxx, "Christian Balzer" <chibi@xxxxxxx>
> > > > *Sent: *Sunday, 1 March, 2015 8:49:56 PM
> > > >
> > > > *Subject: *Re:  SSD selection
> > > >
> > > > Ok, any size suggestion?  Can I get a 120 and be ok?  I see I can
> > > > get DCS3500 120GB for within $120/drive so it's possible to get 6
> > > > of them...
> > > >
> > > > -Tony
> > > >
> > > > On Sun, Mar 1, 2015 at 12:46 PM, Andrei Mikhailovsky
> > > > <andrei@xxxxxxxxxx> wrote:
> > > >
> > > >>
> > > >> I would not use a single ssd for 5 osds. I would recommend the 3-4
> > > >> osds max per ssd or you will get the bottleneck on the ssd side.
> > > >>
> > > >> I've had a reasonable experience with Intel 520 ssds (which are
> > > >> not produced anymore). I've found Samsung 840 Pro to be horrible!
> > > >>
> > > >> Otherwise, it seems that everyone here recommends the DC3500 or
> > > >> DC3700 and it has the best wear per $ ratio out of all the drives.
> > > >>
> > > >> Andrei
> > > >>
> > > >>
> > > >> ------------------------------
> > > >>
> > > >> *From: *"Tony Harris" <nethfel@xxxxxxxxx>
> > > >> *To: *"Christian Balzer" <chibi@xxxxxxx>
> > > >> *Cc: *ceph-users@xxxxxxxxxxxxxx
> > > >> *Sent: *Sunday, 1 March, 2015 4:19:30 PM
> > > >> *Subject: *Re:  SSD selection
> > > >>
> > > >>
> > > >> Well, although I have 7 now per node, you make a good point and
> > > >> I'm in a position where I can either increase to 8 and split 4/4
> > > >> and have 2 ssds, or reduce to 5 and use a single osd per node
> > > >> (the system is not in production yet).
> > > >>
> > > >> Do all the DC lines have caps in them or just the DC s line?
> > > >>
> > > >> -Tony
> > > >>
> > > >> On Sat, Feb 28, 2015 at 11:21 PM, Christian Balzer <chibi@xxxxxxx>
> > > >> wrote:
> > > >>
> > > >>> On Sat, 28 Feb 2015 20:42:35 -0600 Tony Harris wrote:
> > > >>>
> > > >>> > Hi all,
> > > >>> >
> > > >>> > I have a small cluster together and it's running fairly well (3
> > > >>> > nodes,
> > > >>> 21
> > > >>> > osds).  I'm looking to improve the write performance a bit
> > > >>> > though,
> > > >>> which
> > > >>> > I was hoping that using SSDs for journals would do.  But, I was
> > > >>> wondering
> > > >>> > what people had as recommendations for SSDs to act as journal
> > > >>> > drives. If I read the docs on ceph.com correctly, I'll need 2
> > > >>> > ssds per node (with 7 drives in each node, I think the
> > > >>> > recommendation was 1ssd per
> > > >>> 4-5
> > > >>> > drives?) so I'm looking for drives that will work well without
> > > >>> > breaking the bank for where I work (I'll probably have to
> > > >>> > purchase them myself and donate, so my budget is somewhat
> > > >>> > small).  Any suggestions?  I'd prefer one that can finish its
> > > >>> > write in a power outage case, the only one I know of off hand
> > > >>> > is the intel dcs3700 I think, but at $300 it's WAY above my
> > > >>> > affordability range.
> > > >>>
> > > >>> Firstly, an uneven number of OSDs (HDDs) per node will bite you
> > > >>> in the proverbial behind down the road when combined with
> > > >>> journal SSDs, as one of
> > > >>> those SSDs will wear our faster than the other.
> > > >>>
> > > >>> Secondly, how many SSDs you need is basically a trade-off between
> > > >>> price, performance, endurance and limiting failure impact.
> > > >>>
> > > >>> I have cluster where I used 4 100GB DC S3700s with 8 HDD OSDs,
> > > >>> optimizing the write paths and IOPS and failure domain, but not
> > > >>> the sequential speed or cost.
> > > >>>
> > > >>> Depending on what your write load is and the expected lifetime of
> > > >>> this cluster, you might be able to get away with DC S3500s or
> > > >>> even better the new DC S3610s.
> > > >>> Keep in mind that buying a cheap, low endurance SSD now might
> > > >>> cost you more down the road if you have to replace it after a
> > > >>> year (TBW/$).
> > > >>>
> > > >>> All the cheap alternatives to DC level SSDs tend to wear out too
> > > >>> fast, have no powercaps and tend to have unpredictable (caused by
> > > >>> garbage collection) and steadily decreasing performance.
> > > >>>
> > > >>> Christian
> > > >>> --
> > > >>> Christian Balzer        Network/Systems Engineer
> > > >>> chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
> > > >>> http://www.gol.com/
> > > >>>
> > > >>
> > > >>
> > > >> _______________________________________________
> > > >> ceph-users mailing list
> > > >> ceph-users@xxxxxxxxxxxxxx
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com