Re: Best layout for SSD & SAS OSDs

Jan Schermer <jan@xxxxxxxxxxx> · Mon, 7 Sep 2015 12:29:01 +0200

> On 07 Sep 2015, at 12:19, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> On Mon, 7 Sep 2015 12:11:27 +0200 Jan Schermer wrote:
> 
>> Dense SSD nodes are not really an issue for network (unless you really
>> use all the throughput), 
> That's exactly what I wrote...
> And dense in the sense of saturating his network would be 4 SSDs, so:
> 
>> the issue is with CPU and memory throughput
>> (and possibly crappy kernel scheduler depending on how up-to-date distro
>> you use). 
> Thats what I wrote as well, which makes smaller nodes with more CPU
> resources attractive. 
> 
>> Also if you want consistent performance even when failure
>> occurs, you need to either have 100% reliable SSDs, or put them in RAID
>> for the journals. You don't want to rebuild all those HDD OSDs. Losing a
>> journal SSD is more likely than losing a HDD these days.
>> 
> Say what?
> 
> My "Enterprise" HDDs are failing quite nicely, while I have yet to loose a
> single Intel SSD, DC or otherwise.
> 

All I can say is "YMMV".
HDDs are much more proven technology - they die mechanicaly and you can burn them in (they usually either die shortly after put into production or from long-time wear).
SSDs have many issues (and HBAs have issues with SSDs), and some of these issues occur either randomly or because of a bug (like drives failing after exactly 3 months because of some internal timer overflowing).
Bottom line - HDDs are salvageable even when a firmware bug occurs or a DC spike fries them (you can swap electronics). SSDs are dead and you are SOL.

I think it's prudent to consider this bottom line always, not relying on a single component...

> Christian
> 
>> Jan
>> 
>> 
>>> On 07 Sep 2015, at 05:53, Christian Balzer <chibi@xxxxxxx> wrote:
>>> 
>>> On Sat, 5 Sep 2015 07:13:29 -0300 German Anders wrote:
>>> 
>>>> Hi Christian,
>>>> 
>>>>   Ok so would said that it's better to rearrange the nodes so i dont
>>>> mix the hdd and ssd disks right? And create high perf nodes with ssd
>>>> and others with hdd, its fine since its a new deploy.
>>>> 
>>> It is what I would do, yes. 
>>> However if you're limited to 7 nodes initially specialized/optimized
>>> nodes might result in pretty small "subclusters" and thus relatively
>>> large failure domains. 
>>> 
>>> If for example this cluster would consisted of 2 SSD and 5 HDD nodes,
>>> loosing 1 of the SSD nodes would roughly halve your read speed from
>>> that pool (while amusingly enough improve your write speed ^o^).
>>> This is assuming a replication of 2 for SSD pools, which with DC SSDs
>>> is a pretty safe choice.
>>> 
>>> Also dense SSDs nodes will be able to saturate your network easily, for
>>> example 3-4 of the DC S3xxx SSDs will exceed the bandwidth of your
>>> links. This is of course only an issue if you're actually expecting
>>> huge amounts of reads/writes, as apposed to have lots of small
>>> transactions that depend on low latency.
>>> 
>>>>  Also the nodes had different type of ram cpu, 4 had more cpu and
>>>> more memory 384gb and other 3 had less cpu and 128gb of ram, so maybe
>>>> i can put the ssd con the much more cpu nodes and left the hdd for
>>>> the other nodes. 
>>> 
>>> I take it from this that you already have those machines?
>>> Which number and models CPUs exactly?
>>> 
>>> What you want is as MUCH CPU power for any SSD node as possible, while
>>> the HDD nodes will benefit mostly from more RAM (page cache).
>>> 
>>>> Network is going to be used infiniband fdr at 56gb/s on all the
>>>> nodes for the publ network and for the clus network.
>>>> 
>>> Is this 1 interface for the public and 1 for the cluster network?
>>> Note that with IPoIB (with Accelio not being ready yet) I'm seeing at
>>> most 1.5GByte/s with QDR (40Gb/s).
>>> 
>>> If you were to start with a clean slate, I'd go with something like
>>> this to achieve the storage capacity you outlined:
>>> 
>>> * 1-2 Quad node chassis like this with 4-6 SSD ODS per node and a 2nd
>>> IB HCA, or a similar product w/o onboard IB and a 2 port IB HCA:
>>> http://www.supermicro.com.tw/products/system/2U/2028/SYS-2028TP-HTFR.cfm
>>> That will give you 4-8 high performance SSD nodes in 2-4U.
>>> 
>>> * 5 HDD storage nodes, with 8-10 HDDs and 2-4 journal SSDs like this:
>>> http://www.supermicro.com.tw/products/system/2U/5028/SSG-5028R-E1CR12L.cfm
>>> (4 100GB DC S3700 will perform better than 2 200GB ones and give you
>>> smaller failure domains at about the same price).
>>> 
>>> Christian
>>> 
>>>>  Any other suggestion/comment?
>>>> 
>>>> Thanks a lot!
>>>> 
>>>> Best regards
>>>> 
>>>> German
>>>> 
>>>> 
>>>> On Saturday, September 5, 2015, Christian Balzer <chibi@xxxxxxx>
>>>> wrote:
>>>> 
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> On Fri, 4 Sep 2015 12:30:12 -0300 German Anders wrote:
>>>>> 
>>>>>> Hi cephers,
>>>>>> 
>>>>>>  I've the following scheme:
>>>>>> 
>>>>>> 7x OSD servers with:
>>>>>> 
>>>>> Is this a new cluster, total initial deployment?
>>>>> 
>>>>> What else are these nodes made of, CPU/RAM/network?
>>>>> While uniform nodes have some appeal (interchangeability, one node
>>>>> down does impact the cluster uniformly) they tend to be compromise
>>>>> solutions. I personally would go with optimized HDD and SSD nodes.
>>>>> 
>>>>>>   4x 800GB SSD Intel DC S3510 (OSD-SSD)
>>>>> Only 0.3DWPD, 450TB total in 5 years.
>>>>> If you can correctly predict your write volume and it is below that
>>>>> per SSD, fine. I'd use 3610s, with internal journals.
>>>>> 
>>>>>>   3x 120GB SSD Intel DC S3500 (Journals)
>>>>> In this case even more so the S3500 is a bad choice. 3x 135MB/s is
>>>>> nowhere near your likely network speed of 10Gb/s.
>>>>> 
>>>>> You will vastly superior performance and endurance with two 200GB
>>>>> S3610 (2x 230MB/s) or S3700 (2x365 MB/s)
>>>>> 
>>>>> Why the uneven number of journals SSDs?
>>>>> You want uniform utilization, wear. 2 journal SSDs for 6 HDDs would
>>>>> be a good ratio.
>>>>> 
>>>>>>   5x 3TB SAS disks (OSD-SAS)
>>>>>> 
>>>>> See above, even numbers make a lot more sense.
>>>>> 
>>>>>> 
>>>>>> The OSD servers are located on two separate Racks with two power
>>>>>> circuits each.
>>>>>> 
>>>>>>  I would like to know what is the best way to implement this.. use
>>>>>> the 4x 800GB SSD like a SSD-pool, or used them us a Cache pool? or
>>>>>> any other suggestion? Also any advice for the crush design?
>>>>>> 
>>>>> Nick touched on that already, for right now SSD pools would be
>>>>> definitely better.
>>>>> 
>>>>> Christian
>>>>> --
>>>>> Christian Balzer        Network/Systems Engineer
>>>>> chibi@xxxxxxx <javascript:;>        Global OnLine Japan/Fusion
>>>>> Communications
>>>>> http://www.gol.com/
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> -- 
>>> Christian Balzer        Network/Systems Engineer                
>>> chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
>>> http://www.gol.com/
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
> http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com