Re: 6 Node cluster with 24 SSD per node: Hardwareplanning/ agreement

Matteo Dacrema <mdacrema@xxxxxxxx> · Fri, 11 Nov 2016 11:56:42 +0100

Hi,

after your tips and consideration I’ve planned to use this hardware configuration:

- 4x OSD ( for starting the project):
1x Intel E5-1630v4 @ 4.00 Ghz with turbo 4 core, 8 thread , 10MB cache
128GB RAM ( does frequency matter in terms of performance ? ) 
4x Intel P3700 2TB NVME
2x Mellanox Connect-X 3 Pro 40gbit/s

- 3 x MON:
1x Intel E5-1630v4
64GB RAM
2 x Intel S3510 SSD
2x Mellanox Connect-X 3 Pro 10gbit/s

What do you think about?
I don’t know if this CPU works well with ceph workload and if it’s better to use 4x Samsung SM863 1.92TB rather than Intel P3700.
I’ve considered to place the Journal inline.

Thanks
Matteo 

Il giorno 11 ott 2016, alle ore 03:04, Christian Balzer <chibi@xxxxxxx> ha scritto:

Hello,

On Mon, 10 Oct 2016 14:56:40 +0200 Matteo Dacrema wrote:

Hi,

I’m planning a similar cluster.
Because it’s a new project I’ll start with only 2 node cluster witch each:

As Wido said, that's a very dense and risky proposition for a first time
cluster. 
Never mind the lack of 3rd node for 3 MONs is begging for Murphy to come
and smite you.

While I understand the need/wish to save money and space by maximizing
density, that only works sort of when you have plenty of such nodes to
begin with.

Your proposed setup isn't cheap to begin with, consider alternatives like
the one I'm pointing out below.

2x E5-2640v4 with 40 threads total @ 3.40Ghz with turbo
Spendy and still potentially overwhelmed when dealing with small write
IOPS.

24x 1.92 TB Samsung SM863 
Should be fine, but keep in mind that with inline journals they will only
have about a 1.5 DWPD endurance.
At about 5.7GB/s write bandwidth not a total mismatch to your 4GB/s
network link (unless those 2 ports are MC-LAG, giving you 8GB/s).

128GB RAM
3x LSI 3008 in IT mode / HBA for OSD - 1 each 8 OSD/SDDs
Also not free, they need to be on the latest FW and kernel version to work
reliably with SSDs.

2x SSD for OS
2x 40Gbit/s NIC

Consider basing your cluster on two of these 2U 4node servers:
https://www.supermicro.com.tw/products/system/2U/2028/SYS-2028TP-HTTR.cfm

Built-in dual 10Gb/s, the onboard SATA works nicely with SSDs, you can get
better matched CPU(s).

10Gb/s MC-LAG (white box) switches are also widely available and
affordable.

So 8 nodes instead of 2, in the same space.

Of course running a cluster (even with well monitored and reliable SSDs)
with a replication of 2 has risks (and that risk increases with the size of
the SSDs), so you may want to reconsider that.

Christian

What about this hardware configuration? Is that wrong or I’m missing something ?

Regards
Matteo

Il giorno 06 ott 2016, alle ore 13:52, Denny Fuchs <linuxmail@xxxxxxxx> ha scritto:

God morning,

* 2 x SN2100 100Gb/s Switch 16 ports
Which incidentally is a half sized (identical HW really) Arctica 3200C.

really never heart from them :-) (and didn't find any price €/$ region)

* 10 x ConnectX 4LX-EN 25Gb card for hypervisor and OSD nodes
[...]

You haven't commented on my rather lengthy mail about your whole design,
so to reiterate:

maybe accidentally skipped, so much new input  :-) sorry

The above will give you a beautiful, fast (but I doubt you'll need the
bandwidth for your DB transactions), low latency and redundant network
(these switches do/should support MC-LAG). 

Jepp, they do MLAG (with the 25Gbit version of the cx4 NICs)

In more technical terms, your network as depicted above can handle under
normal circumstances around 5GB/s, while your OSD nodes can't write more
than 1GB/s.
Massive, wasteful overkill.

before we started with planing Ceph / new hypervisor design, we where sure that our network would be more powerful, than we need in the near future. Our applications / DB never used the full 1GBs in any way ...  we loosing speed on the plain (painful LANCOM) switches and the applications (mostly Perl written in the beginning of the 2005).
But anyway, the network should be have enough capacity for the next years, because it is much more complicated to change network (design) components, than to kick a node.

With a 2nd NVMe in there you'd be at 2GB/s, or simple overkill.

We would buy them ... so that in the end, every 12 disk has a separated NVMe

With decent SSDs and in-line journals (400GB DC S3610s) you'd be at 4.8
GB/s, a perfect match.

What about the worst case, two nodes are broken, fixed and replaced ? I red (a lot) that some Ceph users had massive problems, while the rebuild runs. 

Of course if your I/O bandwidth needs are actually below 1GB/s at all times
and all your care about is reducing latency, a single NVMe journal will be
fine (but also be a very obvious SPoF).

Very happy  to put the finger in the wound, SPof ... is a very hard thing ... so we try to plan everything redundant  :-)

The bad side of life: the SSD itself. A consumer SSD lays round about 70/80€, a DC SSD jumps up to 120-170€. My nightmare is: a lot of SSDs are jumping over the bridge at the same time .... -> arghh 

But, we are working on it :-)

I've searching an alternative for the Asus board with more PCIe slots and maybe some components; better CPU with 3.5Ghz-> ; maybe a mix from the SSDs ...

At this time, I've found the X10DRi:

https://www.supermicro.com/products/motherboard/xeon/c600/x10dri.cfm<https://www.supermicro.com/products/motherboard/xeon/c600/x10dri.cfm>

and I think we use the E5-2637v4 :-)

cu denny

-- 
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto. 
Clicca qui per segnalarlo come spam. <http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=0E9124029A.A17D5> 
Clicca qui per metterlo in blacklist <http://mx01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=0E9124029A.A17D5>_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/

--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Seguire il link qui sotto per segnalarlo come spam: 
http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=ABCA540C26.A6C6E

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com