Re: 6 Node cluster with 24 SSD per node: Hardwareplanning/ agreement

Christian Balzer <chibi@xxxxxxx> · Tue, 11 Oct 2016 10:04:56 +0900

Hello,

On Mon, 10 Oct 2016 14:56:40 +0200 Matteo Dacrema wrote:

> Hi,
> 
> I’m planning a similar cluster.
> Because it’s a new project I’ll start with only 2 node cluster witch each:
> 
As Wido said, that's a very dense and risky proposition for a first time
cluster. 
Never mind the lack of 3rd node for 3 MONs is begging for Murphy to come
and smite you.

While I understand the need/wish to save money and space by maximizing
density, that only works sort of when you have plenty of such nodes to
begin with.

Your proposed setup isn't cheap to begin with, consider alternatives like
the one I'm pointing out below.

> 2x E5-2640v4 with 40 threads total @ 3.40Ghz with turbo
Spendy and still potentially overwhelmed when dealing with small write
IOPS.

> 24x 1.92 TB Samsung SM863 
Should be fine, but keep in mind that with inline journals they will only
have about a 1.5 DWPD endurance.
At about 5.7GB/s write bandwidth not a total mismatch to your 4GB/s
network link (unless those 2 ports are MC-LAG, giving you 8GB/s).

> 128GB RAM
> 3x LSI 3008 in IT mode / HBA for OSD - 1 each 8 OSD/SDDs
Also not free, they need to be on the latest FW and kernel version to work
reliably with SSDs.

> 2x SSD for OS
> 2x 40Gbit/s NIC
> 
> 
Consider basing your cluster on two of these 2U 4node servers:
https://www.supermicro.com.tw/products/system/2U/2028/SYS-2028TP-HTTR.cfm

Built-in dual 10Gb/s, the onboard SATA works nicely with SSDs, you can get
better matched CPU(s).

10Gb/s MC-LAG (white box) switches are also widely available and
affordable.

So 8 nodes instead of 2, in the same space.

Of course running a cluster (even with well monitored and reliable SSDs)
with a replication of 2 has risks (and that risk increases with the size of
the SSDs), so you may want to reconsider that.

Christian

> What about this hardware configuration? Is that wrong or I’m missing something ?
> 
> Regards
> Matteo
> 
> > Il giorno 06 ott 2016, alle ore 13:52, Denny Fuchs <linuxmail@xxxxxxxx> ha scritto:
> > 
> > God morning,
> > 
> >>> * 2 x SN2100 100Gb/s Switch 16 ports
> >> Which incidentally is a half sized (identical HW really) Arctica 3200C.
> >  
> > really never heart from them :-) (and didn't find any price €/$ region)
> >  
> > 
> >>> * 10 x ConnectX 4LX-EN 25Gb card for hypervisor and OSD nodes
> > [...]
> > 
> >> You haven't commented on my rather lengthy mail about your whole design,
> >> so to reiterate:
> >  
> > maybe accidentally skipped, so much new input  :-) sorry
> > 
> >> The above will give you a beautiful, fast (but I doubt you'll need the
> >> bandwidth for your DB transactions), low latency and redundant network
> >> (these switches do/should support MC-LAG). 
> >  
> > Jepp, they do MLAG (with the 25Gbit version of the cx4 NICs)
> >  
> >> In more technical terms, your network as depicted above can handle under
> >> normal circumstances around 5GB/s, while your OSD nodes can't write more
> >> than 1GB/s.
> >> Massive, wasteful overkill.
> >  
> > before we started with planing Ceph / new hypervisor design, we where sure that our network would be more powerful, than we need in the near future. Our applications / DB never used the full 1GBs in any way ...  we loosing speed on the plain (painful LANCOM) switches and the applications (mostly Perl written in the beginning of the 2005).
> > But anyway, the network should be have enough capacity for the next years, because it is much more complicated to change network (design) components, than to kick a node.
> >  
> >> With a 2nd NVMe in there you'd be at 2GB/s, or simple overkill.
> >  
> > We would buy them ... so that in the end, every 12 disk has a separated NVMe
> > 
> >> With decent SSDs and in-line journals (400GB DC S3610s) you'd be at 4.8
> >> GB/s, a perfect match.
> >  
> > What about the worst case, two nodes are broken, fixed and replaced ? I red (a lot) that some Ceph users had massive problems, while the rebuild runs. 
> >  
> > 
> >> Of course if your I/O bandwidth needs are actually below 1GB/s at all times
> >> and all your care about is reducing latency, a single NVMe journal will be
> >> fine (but also be a very obvious SPoF).
> > 
> > Very happy  to put the finger in the wound, SPof ... is a very hard thing ... so we try to plan everything redundant  :-)
> >  
> > The bad side of life: the SSD itself. A consumer SSD lays round about 70/80€, a DC SSD jumps up to 120-170€. My nightmare is: a lot of SSDs are jumping over the bridge at the same time .... -> arghh 
> >  
> > But, we are working on it :-)
> >  
> > I've searching an alternative for the Asus board with more PCIe slots and maybe some components; better CPU with 3.5Ghz-> ; maybe a mix from the SSDs ...
> >  
> > At this time, I've found the X10DRi:
> >  
> > https://www.supermicro.com/products/motherboard/xeon/c600/x10dri.cfm <https://www.supermicro.com/products/motherboard/xeon/c600/x10dri.cfm>
> >  
> > and I think we use the E5-2637v4 :-)
> >  
> >  cu denny
> >  
> > 
> > -- 
> > Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto. 
> > Clicca qui per segnalarlo come spam. <http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=0E9124029A.A17D5> 
> > Clicca qui per metterlo in blacklist <http://mx01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=0E9124029A.A17D5>_______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com