Re: Ceph Supermicro hardware recommendation

Christian Balzer <chibi@xxxxxxx> · Thu, 5 Feb 2015 14:35:04 +0900

Hello,

On Wed, 4 Feb 2015 09:20:24 +0000 Colombo Marco wrote:

> Hi Christian,
> 
> 
> 
> On 04/02/15 02:39, "Christian Balzer" <chibi@xxxxxxx> wrote:
> 
> >On Tue, 3 Feb 2015 15:16:57 +0000 Colombo Marco wrote:
> >
> >> Hi all,
> >>  I have to build a new Ceph storage cluster, after i‘ve read the
> >> hardware recommendations and some mail from this mailing list i would
> >> like to buy these servers:
> >> 
> >
> >Nick mentioned a number of things already I totally agree with, so don't
> >be surprised if some of this feels like a repeat.
> >
> >> OSD:
> >> SSG-6027R-E1R12L ->
> >> http://www.supermicro.nl/products/system/2U/6027/SSG-6027R-E1R12L.cfm
> >> Intel Xeon e5-2630 v2 64 GB RAM
> >As nick said, v3 and more RAM might be helpful, depending on your use
> >case (small writes versus large ones) even faster CPUs as well.
> 
> Ok, we switch from v2 to v3 and from 64 to 96 GB of RAM.
> 
> >
> >> LSI 2308 IT
> >> 2 x SSD Intel DC S3700 400GB
> >> 2 x SSD Intel DC S3700 200GB
> >Why the separation of SSDs? 
> >They aren't going to be that busy with regards to the OS.
> 
> We would like to use 400GB SSD for a cache pool, and 200GB SSD for the 
> journaling.
>
Don't, at least not like that.
First and foremost, SSD based OSDs/pools have different requirements,
especially when it comes to CPU. 
Mixing your HDD and SSD based OSDs in the same chassis is a generally a bad
idea.
If you really want to use SSD based OSDs, got at least with Giant,
probably better even to wait for Hammer. 
Otherwise your performance will be nowhere near the investment you're
making. 
Read up in the ML archives about SSD based clusters and their performance,
as well as cache pools.

Which brings us to the second point, cache pools are pretty pointless
currently when it comes to performance. So unless you're planning to use
EC pools, you will gain very little from them.

Lastly, if you still want to do SSD based OSDs, go for something like this:
http://www.supermicro.com.tw/products/system/2U/2028/SYS-2028TP-DC0TR.cfm
Add the fastest CPUs you can afford and voila, instant SSD based cluster
(replication of 2 should be fine with DC S3700). 
Now with _this_ particular type of nodes, you might want to consider 40GbE
links (front and back-end).

> >
> >Get a case like Nick mentioned with 2 2.5 bays in the back, put 2 DC
> >S3700 400GBs in there (connected to onboard 6Gb/s SATA3), partition
> >them so that you have a RAID1 for OS and plain partitions for the
> >journals of the now 12
> >OSD HDDs in your chassis. 
> >Of course this optimization in terms of cost and density comes with a
> >price, if one SSD should fail, you will have 6 OSDs down. 
> >Given how reliable the Intels are this is unlikely, but something you
> >need to consider.
> >
> >If you want to limit the impact of a SSD failure and have just 2 OSD
> >journals per SSD, get a chassis like the one above and 4 DC S3700 200GB,
> >RAID10 them for the OS and put 2 journal partitions on each. 
> >
> >I did the same with 8 3TB HDDs and 4 DC S3700 100GB, the HDDs (and CPU
> >with 4KB IOPS), are the limiting factor, not the SSDs.
> >
> >> 8 x HDD Seagate Enterprise 6TB
> >Are you really sure you need that density? One disk failure will result
> >in a LOT of data movement once these become somewhat full.
> >If you were to go for a 12 OSD node as described above, consider 4TB
> >ones for the same overall density, while having more IOPS and likely
> >the same price or less.
> 
> We choosen the 6TB of disk, because we need a lot of storage in a small 
> amount of server and we prefer server with not too much disks.
> However we plan to use max 80% of a 6TB Disk
>
Less disks, less IOPS, less bandwidth. 
Reducing the amount of servers (which are fixed cost after all) is
understandable. But you have an option up there that gives you the same
density as with the 6TB disks, but with a significantly improved
performance.

> >
> >> 2 x 40GbE for backend network
> >You'd be lucky to write more that 800MB/s sustained to your 8 HDDs
> >(remember they will have to deal with competing reads and writes, this
> >is not a sequential synthetic write benchmark). 
> >Incidentally 1GB/s to 1.2GB/s (depending on configuration) would also be
> >the limit of your journal SSDs.
> >Other than backfilling caused by cluster changes (OSD removed/added),
> >your limitation is nearly always going to be IOPS, not bandwidth.
> 
> 
> Ok, after some discussion, we switch to 2 x 10 GbE.
> 
> >
> >So 2x10GbE or if you're comfortable with it (I am ^o^) an Infiniband
> >backend (can be cheaper, less latency, plans for RDMA support in
> >Ceph) should be more than sufficient.
> >
> >> 2 x 10GbE  for public network
> >> 
> >> META/MON:
> >> 
> >> SYS-6017R-72RFTP ->
> >> http://www.supermicro.com/products/system/1U/6017/SYS-6017R-72RFTP.cfm
> >> 2 x Intel Xeon e5-2637 v2 4 x SSD Intel DC S3500 240GB raid 1+0
> >You're likely to get better performance and of course MUCH better
> >durability by using 2 DC S3700, at about the same price.
> 
> Ok we switch to 2 x SSD DC S3700
> 
> >
> >> 128 GB RAM
> >Total overkill for a MON, but I have no idea about MDS and RAM never 
> >hurts.
> 
> Ok we switch from 128 to 96
> 
Don't take my word for this, again I have no idea about MDS and what it
needs to be happy.

Christian
> >
> >In your follow-up you mentioned 3 mons, I would suggest putting 2 more
> >mons (only, not MDS) on OSD nodes and make sure that within the IP
> >numbering the "real" mons have the lowest IP addresses, because the MON
> >with the lowest IP becomes master (and thus the busiest). 
> >This way you can survive a loss of 2 nodes and still have a valid
> >quorum.
> 
> Ok, got it
> 
> 
> >
> >Christian
> >
> >> 2 x 10 GbE
> >> 
> >> What do you think?
> >> Any feedbacks, advices, or ideas are welcome!
> >> 
> >> Thanks so much
> >> 
> >> Regards,
> >
> >
> >-- 
> >Christian Balzer        Network/Systems Engineer                
> >chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
> >http://www.gol.com/
> 
> Thanks so much!
> 
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com