Re: 6 Node cluster with 24 SSD per node: Hardware planning / agreement

Nick Fisk <nick@xxxxxxxxxx> · Tue, 4 Oct 2016 16:07:43 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Denny Fuchs
> Sent: 04 October 2016 15:51
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  6 Node cluster with 24 SSD per node: Hardware planning / agreement
> 
> Hi,
> 
> thanks for take a look :-)
> 
> Am 04.10.2016 16:11, schrieb Nick Fisk:
> 
> >> We have two goals:
> >>
> >> * High availability
> >> * Short latency for our transaction services
> >
> > How Low? See below re CPU's
> 
> so low, what is possible without doing crazy stuff. We thinking to put the database on CEPH too, instead of local SSDs on
separated
> servers.

With Xeon E3 1245's (3.6Ghz with all 4 cores Turbo'd) and P3700 Journal with 10GB networking I have managed to get it down to around
600-700us. Make sure you force P-States and C-states as without I was only getting about 2ms.

> 
> 
> >> via API so a separated meta server isn't needed, as I understand all
> >> the documents right.
> >
> > Meta data server is for CephFS (Distributed Filesystem) for direct
> > librados library calls or RBD (Block Devices) you need only mon's
> > and osd's.
> 
> perfect :-)
> 
> 
> >> All nodes are connected over cross to every switch, so if one switch
> >> goes down, a second path is available.
> >
> > Isn't that a 1GB switch with a couple of 10G modules? Any reason you
> > can't get a pure 10G switch?
> 
> right. The reason is, we have them already with stack modules and dual
> power supply. So we need only the 10Gbit backplane modules. That's it.

Ah ok, fair do's. Are the hypervisors connected via 10G as well, or will they be 1G? You want 10G end to end to get the lowest
latency.

> 
> 
> >> * Disk:
> >> ** Storage: 24 x Crucial MX300 250GB (maybe for production 12xSSD /
> >> 12x
> >> big Sata disks)
> >
> > I would be very careful about using these. They are not enterprise
> > SSD's. I would go for either S3610 or S3510 if you will be doing
> > mainly reads.
> 
> there was a long long discussion (also here on the list ...) I would
> also prefer enterprise SSDs ... but they are too expensive ... maybe for
> storage we could use the Samsung 850pro series or what fits in the same
> price region. I would personally use SSDs with power loss protection so
> the Intel S3510 / S37xx fits also and is on a second buy list.
> 
> 
> >> ** OSD journal: 1 x Intel SSD DC P3700 PCIe
> >
> > That will not be enough to journal 24x SSD's. Or is this just for the
> > SATA disks and SSD's have no journals? which in case it will
> > be fine.
> 
> hmm, Ok, I was nearly sure to hit this question ..... yes it would be
> for all journals. If you would say, we don't need it if we put the OSD
> journals on each OSD ...
> 
> We would use the 400GB DC P3700 PCIe edition for journals.
> Otherwise I would read between the lines, we need two of them to carry
> all the journals from all SSD drives.

If you were using S3610 or S3710, I would say you might get away with co-located journals, with maybe a latency penalty somewhere
around 50us. However I would think that would definitely want to journal consumer based SSD's otherwise they will likely have a very
short life. The 400GB P3700 will give you ~1000MB/s, which will match your 10G network so might be ok. Is that ok with you? Or would
you want to be able to drive the SSD's harder in the future if you expanded your networking? Also think about failure domains. You
lose the P3700, you lose all 24x SSD OSD's.

> 
> >> really needed in our case. Sure, the cache is one of the benefits, but
> >> maybe it is more complicated, than a plain HBA.
> >
> > Yeah, RAID controllers can sometimes increase performance slightly due
> > to write back cache, but they can also get overwhelmed and
> > end up being slower. Especially with SSD's you are probably best with
> > plain HBA.
> 
> great to hear :-)
> 
> >> The OS would be Proxmox 4.x (based on Debian Jessie) with Hammer or
> >> Jewel, but WITHOUT ANY VMs on it. We want to keep the systems are in
> >> one
> >> hand :-)
> >
> > Why are you going to run Proxmox with no VM's just for Ceph? What's
> > wrong with just Ubuntu or Debian?
> 
> Proxmox would be become our main hypervisor and Ceph is builtin
> technology with all kind of stuff, which is needed. So in the end we
> have 6 OSD nodes and 4 hypervisor, all under the "umbrella" from
> Proxmox.
> So the documentation and maintenance is much easier as it is based on
> one plattform.

Understood

> 
> >> So we want to know, the hardware should be O.K also with running the
> >> mon
> >> servers on the same HW, like the OSDs. We know, that every OSD should
> >> own a core, so the 2620v4 has 8 cores, with HT 16 and in sum we have
> >> 32
> >> CPUs per OSD node, which should be fine, .... I think ....
> >
> > I would play less attention to the number of cores + osd's, instead
> > look at the total number of Ghz and number of IOPs you require.
> > I have been doing some testing recently and have come up a figure of
> > around 1Mhz per IO. I will be writing up a blog article with
> > more details in the near future.
> >
> > If you need low number of IO's but with low latency, I would go with
> > lower number of cores with very fast cores (3.5Ghz+). Otherwise
> > if you think you will be generating 100's thousands of IO's then you
> > probably want more cores and will have to take the increased
> > latency due to slower cores as a compromise.
> 
> that is extremely nice to know !! Most of documentation is based on the
> core, but not the plain Mhz.

For best low latency performance, I would personally recommend scaling out with more nodes using high clocked single socket Xeon E3
or Xeon E5 16xx rather than go with big boxes with high core count CPU's.

I used this board for my latest cluster, has a lot of stuff on board to save buying addon cards.

https://www.supermicro.com/products/motherboard/Xeon/C236_C232/X11SSH-CTF.cfm

> 
> 
> thank you for the comments :-)
> 
> cu denny
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com