Hi,
after your tips and consideration I’ve planned to use this hardware configuration:
- 4x OSD ( for starting the project): - 1x Intel E5-1630v4 @ 4.00 Ghz with turbo 4 core, 8 thread , 10MB cache
- 128GB RAM ( does frequency matter in terms of performance ? )
- 4x Intel P3700 2TB NVME
- 2x Mellanox Connect-X 3 Pro 40gbit/s
- 3 x MON: - 1x Intel E5-1630v4
- 64GB RAM
- 2 x Intel S3510 SSD
- 2x Mellanox Connect-X 3 Pro 10gbit/s
What do you think about? I don’t know if this CPU works well with ceph workload and if it’s better to use 4x Samsung SM863 1.92TB rather than Intel P3700. I’ve considered to place the Journal inline.
Thanks Matteo
Il giorno 11 ott 2016, alle ore 03:04, Christian Balzer < chibi@xxxxxxx> ha scritto:
Hello,On Mon, 10 Oct 2016 14:56:40 +0200 Matteo Dacrema wrote:Hi,
I’m planning a similar cluster. Because it’s a new project I’ll start with only 2 node cluster witch each:
As Wido said, that's a very dense and risky proposition for a first timecluster. Never mind the lack of 3rd node for 3 MONs is begging for Murphy to comeand smite you.While I understand the need/wish to save money and space by maximizingdensity, that only works sort of when you have plenty of such nodes tobegin with.Your proposed setup isn't cheap to begin with, consider alternatives likethe one I'm pointing out below.2x E5-2640v4 with 40 threads total @ 3.40Ghz with turbo
Spendy and still potentially overwhelmed when dealing with small writeIOPS.24x 1.92 TB Samsung SM863
Should be fine, but keep in mind that with inline journals they will onlyhave about a 1.5 DWPD endurance.At about 5.7GB/s write bandwidth not a total mismatch to your 4GB/snetwork link (unless those 2 ports are MC-LAG, giving you 8GB/s).128GB RAM 3x LSI 3008 in IT mode / HBA for OSD - 1 each 8 OSD/SDDs
Also not free, they need to be on the latest FW and kernel version to workreliably with SSDs.2x SSD for OS 2x 40Gbit/s NIC
Consider basing your cluster on two of these 2U 4node servers:https://www.supermicro.com.tw/products/system/2U/2028/SYS-2028TP-HTTR.cfmBuilt-in dual 10Gb/s, the onboard SATA works nicely with SSDs, you can getbetter matched CPU(s).10Gb/s MC-LAG (white box) switches are also widely available andaffordable.So 8 nodes instead of 2, in the same space.Of course running a cluster (even with well monitored and reliable SSDs)with a replication of 2 has risks (and that risk increases with the size ofthe SSDs), so you may want to reconsider that.ChristianWhat about this hardware configuration? Is that wrong or I’m missing something ?
Regards Matteo
Il giorno 06 ott 2016, alle ore 13:52, Denny Fuchs <linuxmail@xxxxxxxx> ha scritto:
God morning,
* 2 x SN2100 100Gb/s Switch 16 ports
Which incidentally is a half sized (identical HW really) Arctica 3200C.
really never heart from them :-) (and didn't find any price €/$ region)
* 10 x ConnectX 4LX-EN 25Gb card for hypervisor and OSD nodes
[...]
You haven't commented on my rather lengthy mail about your whole design, so to reiterate:
maybe accidentally skipped, so much new input :-) sorry
The above will give you a beautiful, fast (but I doubt you'll need the bandwidth for your DB transactions), low latency and redundant network (these switches do/should support MC-LAG).
Jepp, they do MLAG (with the 25Gbit version of the cx4 NICs)
In more technical terms, your network as depicted above can handle under normal circumstances around 5GB/s, while your OSD nodes can't write more than 1GB/s. Massive, wasteful overkill.
before we started with planing Ceph / new hypervisor design, we where sure that our network would be more powerful, than we need in the near future. Our applications / DB never used the full 1GBs in any way ... we loosing speed on the plain (painful LANCOM) switches and the applications (mostly Perl written in the beginning of the 2005). But anyway, the network should be have enough capacity for the next years, because it is much more complicated to change network (design) components, than to kick a node.
With a 2nd NVMe in there you'd be at 2GB/s, or simple overkill.
We would buy them ... so that in the end, every 12 disk has a separated NVMe
With decent SSDs and in-line journals (400GB DC S3610s) you'd be at 4.8 GB/s, a perfect match.
What about the worst case, two nodes are broken, fixed and replaced ? I red (a lot) that some Ceph users had massive problems, while the rebuild runs.
Of course if your I/O bandwidth needs are actually below 1GB/s at all times and all your care about is reducing latency, a single NVMe journal will be fine (but also be a very obvious SPoF).
Very happy to put the finger in the wound, SPof ... is a very hard thing ... so we try to plan everything redundant :-)
The bad side of life: the SSD itself. A consumer SSD lays round about 70/80€, a DC SSD jumps up to 120-170€. My nightmare is: a lot of SSDs are jumping over the bridge at the same time .... -> arghh
But, we are working on it :-)
I've searching an alternative for the Asus board with more PCIe slots and maybe some components; better CPU with 3.5Ghz-> ; maybe a mix from the SSDs ...
At this time, I've found the X10DRi:
https://www.supermicro.com/products/motherboard/xeon/c600/x10dri.cfm<https://www.supermicro.com/products/motherboard/xeon/c600/x10dri.cfm>
and I think we use the E5-2637v4 :-)
cu denny
-- Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto. Clicca qui per segnalarlo come spam. <http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=0E9124029A.A17D5> Clicca qui per metterlo in blacklist <http://mx01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=0E9124029A.A17D5>_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communicationshttp://www.gol.com/--Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.Seguire il link qui sotto per segnalarlo come spam: http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=ABCA540C26.A6C6E
|