Re: Cost- and Powerefficient OSD-Nodes

Scott Laird <scott@xxxxxxxxxxx> · Tue, 28 Apr 2015 17:12:32 +0000

FYI, most Juniper switches hash LAGs on IP+port, so you'd get somewhat better performance than you would with simple MAC or IP hashing.  10G is better if you can afford it, though.

On Tue, Apr 28, 2015 at 9:55 AM Nick Fisk <nick@xxxxxxxxxx> wrote:

> -----Original Message-----

> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of

> Dominik Hannen

> Sent: 28 April 2015 17:08

> To: Nick Fisk

> Cc: ceph-users@xxxxxxxxxxxxxx

> Subject: Re:  Cost- and Powerefficient OSD-Nodes

>

> >> Interconnect as currently planned:

> >> 4 x 1Gbit LACP Bonds over a pair of MLAG-capable switches (planned:

> >> EX3300)

>

> > If you can do 10GB networking its really worth it. I found that with

> > 1G, latency effects your performance before you max out the bandwidth.

> > We got some Supermicro servers with 10GB-T onboard for a tiny price

> > difference and some basic 10GB-T switches.

>

> I do not except to max out the bandwidth. My estimation would be 200 MB/s

> r/w are needed at maximum.

>

> The performance-metric that suffers most as far as I read would be IOPS?

> How many IOps do you think will be possible with 8 x 4osd-nodes with

> 4x1Gbit (distributed among all the clients, VMs, etc)

It's all about the total latency per operation. Most IO sizes over 10GB

don't make much difference to the Round Trip Time. But comparatively even

128KB IO's over 1GB take quite a while. For example ping a host with a

payload of 64k over 1GB and 10GB networks and look at the difference in

times. Now double this for Ceph (Client->Prim OSD->Sec OSD)

When you are using SSD journals you normally end up with write latency of

3-4ms over 10GB, 1GB networking will probably increase this by another

2-4ms. IOPs=1000/latency

I guess it all really depends on how important performance is

>

> >> 250GB SSD - Journal (MX200 250GB with extreme over-provisioning,

> >> staggered deployment, monitored for TBW-Value)

>

> > Not sure if that SSD would be suitable for a journal. I would

> > recommend going with one of the Intel 3700's. You could also save a

> > bit and run the OS from it.

>

> I am still on the fence about ditching the SATA-DOM and install the OS on

the

> SSD as well.

>

> If the MX200 turn out to be unsuited, I can still use them for other

purposes

> and fetch some better SSDs later.

>

> >> Seagate Surveillance HDD (ST3000VX000) 7200rpm

>

> > Would also possibly consider a more NAS/Enterprise friendly HDD

>

> I thought video-surveillance HDDs would be a nice fit, they are build to

run

> 24/7 and to write multiple data-stream at the same time to disk.

> Also cheap, which enables me to get more nodes from the start.

Just had a look and the Seagate Surveillance disks spin at 7200RPM (missed

that you put that there), whereas the WD ones that I am familiar with spin

at 5400rpm, so not as bad as I thought.

So probably ok to use, but I don't see many people using them for Ceph/

generic NAS so can't be sure there's no hidden gotchas.

>

> > CPU might be on the limit, but would probably suffice. If anything you

> > won't max out all the cores, but the overall speed of the CPU might

> > increase latency, which may or may not be a problem for you.

>

> Do you have some values, so that I can imagine the difference?

> I also maintain another cluster with dual-socket hexa-core Xeon

12osd-nodes

> and all the CPUs do is idling. And the 2x10G LACP Link is usually never

used

> above 1 Gbit.

> Hence the focus on cost-efficiency with this build.

Sorry nothing in detail, I did actually build a ceph cluster on the same 8

core CPU as you have listed. I didn't have any performance problems but I do

remember with SSD journals when doing high queue depth writes I could get

the CPU quite high. It's like what I said before about the 1vs10Gb

networking, how important is performance, If using this CPU gives you an

extra 1ms of latency per OSD, is that acceptable?

Agree 12cores (guessing 2.5Ghz each) will be an overkill for just 12 OSDs. I

have a very similar spec and see exactly the same as you, but will change

the nodes to 1CPU each when I expand and use the spare CPU's for the new

nodes.

I'm using this:-

http://www.supermicro.nl/products/system/4U/F617/SYS-F617H6-FTPTL_.cfm

Mainly because of rack density, which I know doesn't apply to you. But the

fact they share PSU's/Rails/Chassis helps reduce power a bit and drives down

cost

I can get 14 disks in each and they have 10GB on board. The SAS controller

is flashable to JBOD mode.

Maybe one of the other Twin solutions might be suitable?

>

> >> Are there any cost-effective suggestions to improve this configuration?

>

> > Have you looked at a normal Xeon based server but with more disks per

> > node? Depending on how much capacity you need spending a little more

> > per server but allowing you to have more disks per server might work

> > out cheaper.

>

> > There are some interesting SuperMicro combinations, or if you want to

> > go really cheap, you could buy Case,MB,CPU...etc separately and build

> > yourself.

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com