Re: 6 Node cluster with 24 SSD per node: Hardwareplanning/ agreement

Nick Fisk <nick@xxxxxxxxxx> · Fri, 11 Nov 2016 19:19:23 -0000

Hi Matteo,

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Matteo Dacrema
> Sent: 11 November 2016 10:57
> To: Christian Balzer <chibi@xxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  6 Node cluster with 24 SSD per node: Hardwareplanning/ agreement
> 
> Hi,
> 
> after your tips and consideration I’ve planned to use this hardware configuration:
> 
> - 4x OSD ( for starting the project):
> • 1x Intel E5-1630v4 @ 4.00 Ghz with turbo 4 core, 8 thread , 10MB cache
> • 128GB RAM ( does frequency matter in terms of performance ? )
> • 4x Intel P3700 2TB NVME
> • 2x Mellanox Connect-X 3 Pro 40gbit/s

I would maybe try and look at the higher core count 1600's, they might give you a bit more total performance as you will need it for NVME. I recently did some testing on how much Mhz each Ceph IO roughly needs. 

http://www.sys-pro.co.uk/how-many-mhz-does-a-ceph-io-need/

The figure will probably vary significantly depending on several factors, but might be handy for a rough guide.

You might also want to see if you can get your hands on any of them 25/50/100GB networking stuff. They are clocked a lot faster than the 10/40GB products so will likely help with latency.

A 128GB ram may also be an overkill, although extra ram is always nice, for 8TB of storage, 16GB is probably a sufficient amount.

> 
> - 3 x MON:
> • 1x Intel E5-1630v4
> • 64GB RAM
> • 2 x Intel S3510 SSD
> • 2x Mellanox Connect-X 3 Pro 10gbit/s

This looks fine for the Mons.

> 
> What do you think about?
> I don’t know if this CPU works well with ceph workload and if it’s better to use 4x Samsung SM863 1.92TB rather than Intel P3700.
> I’ve considered to place the Journal inline.
> 
> Thanks
> Matteo
> 
> Il giorno 11 ott 2016, alle ore 03:04, Christian Balzer <mailto:chibi@xxxxxxx> ha scritto:
> 
> 
> Hello,
> 
> On Mon, 10 Oct 2016 14:56:40 +0200 Matteo Dacrema wrote:
> 
> 
> Hi,
> 
> I’m planning a similar cluster.
> Because it’s a new project I’ll start with only 2 node cluster witch each:
> As Wido said, that's a very dense and risky proposition for a first time
> cluster.
> Never mind the lack of 3rd node for 3 MONs is begging for Murphy to come
> and smite you.
> 
> While I understand the need/wish to save money and space by maximizing
> density, that only works sort of when you have plenty of such nodes to
> begin with.
> 
> Your proposed setup isn't cheap to begin with, consider alternatives like
> the one I'm pointing out below.
> 
> 
> 2x E5-2640v4 with 40 threads total @ 3.40Ghz with turbo
> Spendy and still potentially overwhelmed when dealing with small write
> IOPS.
> 
> 
> 24x 1.92 TB Samsung SM863
> Should be fine, but keep in mind that with inline journals they will only
> have about a 1.5 DWPD endurance.
> At about 5.7GB/s write bandwidth not a total mismatch to your 4GB/s
> network link (unless those 2 ports are MC-LAG, giving you 8GB/s).
> 
> 
> 128GB RAM
> 3x LSI 3008 in IT mode / HBA for OSD - 1 each 8 OSD/SDDs
> Also not free, they need to be on the latest FW and kernel version to work
> reliably with SSDs.
> 
> 
> 2x SSD for OS
> 2x 40Gbit/s NIC
> 
> Consider basing your cluster on two of these 2U 4node servers:
> https://www.supermicro.com.tw/products/system/2U/2028/SYS-2028TP-HTTR.cfm
> 
> Built-in dual 10Gb/s, the onboard SATA works nicely with SSDs, you can get
> better matched CPU(s).
> 
> 10Gb/s MC-LAG (white box) switches are also widely available and
> affordable.
> 
> So 8 nodes instead of 2, in the same space.
> 
> Of course running a cluster (even with well monitored and reliable SSDs)
> with a replication of 2 has risks (and that risk increases with the size of
> the SSDs), so you may want to reconsider that.
> 
> Christian
> 
> 
> What about this hardware configuration? Is that wrong or I’m missing something ?
> 
> Regards
> Matteo
> 
> 
> Il giorno 06 ott 2016, alle ore 13:52, Denny Fuchs <mailto:linuxmail@xxxxxxxx> ha scritto:
> 
> God morning,
> 
> 
> * 2 x SN2100 100Gb/s Switch 16 ports
> Which incidentally is a half sized (identical HW really) Arctica 3200C.
> 
> really never heart from them :-) (and didn't find any price €/$ region)
> 
> 
> 
> * 10 x ConnectX 4LX-EN 25Gb card for hypervisor and OSD nodes
> [...]
> 
> 
> You haven't commented on my rather lengthy mail about your whole design,
> so to reiterate:
> 
> maybe accidentally skipped, so much new input  :-) sorry
> 
> 
> The above will give you a beautiful, fast (but I doubt you'll need the
> bandwidth for your DB transactions), low latency and redundant network
> (these switches do/should support MC-LAG).
> 
> Jepp, they do MLAG (with the 25Gbit version of the cx4 NICs)
> 
> 
> In more technical terms, your network as depicted above can handle under
> normal circumstances around 5GB/s, while your OSD nodes can't write more
> than 1GB/s.
> Massive, wasteful overkill.
> 
> before we started with planing Ceph / new hypervisor design, we where sure that our network would be more powerful, than we
> need in the near future. Our applications / DB never used the full 1GBs in any way ...  we loosing speed on the plain (painful LANCOM)
> switches and the applications (mostly Perl written in the beginning of the 2005).
> But anyway, the network should be have enough capacity for the next years, because it is much more complicated to change network
> (design) components, than to kick a node.
> 
> 
> With a 2nd NVMe in there you'd be at 2GB/s, or simple overkill.
> 
> We would buy them ... so that in the end, every 12 disk has a separated NVMe
> 
> 
> With decent SSDs and in-line journals (400GB DC S3610s) you'd be at 4.8
> GB/s, a perfect match.
> 
> What about the worst case, two nodes are broken, fixed and replaced ? I red (a lot) that some Ceph users had massive problems,
> while the rebuild runs.
> 
> 
> 
> Of course if your I/O bandwidth needs are actually below 1GB/s at all times
> and all your care about is reducing latency, a single NVMe journal will be
> fine (but also be a very obvious SPoF).
> 
> Very happy  to put the finger in the wound, SPof ... is a very hard thing ... so we try to plan everything redundant  :-)
> 
> The bad side of life: the SSD itself. A consumer SSD lays round about 70/80€, a DC SSD jumps up to 120-170€. My nightmare is: a lot of
> SSDs are jumping over the bridge at the same time .... -> arghh
> 
> But, we are working on it :-)
> 
> I've searching an alternative for the Asus board with more PCIe slots and maybe some components; better CPU with 3.5Ghz-> ;
> maybe a mix from the SSDs ...
> 
> At this time, I've found the X10DRi:
> 
> https://www.supermicro.com/products/motherboard/xeon/c600/x10dri.cfm<https://www.supermicro.com/products/motherboard
> /xeon/c600/x10dri.cfm>
> 
> and I think we use the E5-2637v4 :-)
> 
> cu denny
> 
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
> Clicca qui per segnalarlo come spam. <http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=0E9124029A.A17D5>
> Clicca qui per metterlo in blacklist <http://mx01.enter.it/cgi-bin/learn-
> msg.cgi?blacklist=1&id=0E9124029A.A17D5>_______________________________________________
> ceph-users mailing list
> mailto:ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> mailto:chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
> Seguire il link qui sotto per segnalarlo come spam:
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=ABCA540C26.A6C6E

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com