Re: 6 Node cluster with 24 SSD per node: Hardware planning / agreement

Christian Balzer <chibi@xxxxxxx> · Wed, 5 Oct 2016 10:45:51 +0900

Hello,

replying to the original post for quoting reasons.

Totally agree with what the others (Nick and Burkhard) wrote.

On Tue, 04 Oct 2016 15:43:18 +0200 Denny Fuchs wrote:

> Hello,
> 
> we are brand new to Ceph and planing it as our future storage for 
> KVM/LXC VMs as replacement for Xen / DRBD / Pacemaker / Synology (NFS) 
> stuff.
> 
> 
> We have two goals:
> 
> * High availability
> * Short latency for our transaction services
Search the ML archives, previous posts by Nick in particular.

A lot of things are possible, but since for example reads are always local
with DRBD you may be surprised by some performance results.

> * For later: replication to different datacenter connected via 10Gb/s FC
> 
If this is async replication via RBD mirroring you have a chance.
Though this is brand new in Jewel and has quite some rough edges and large
potential for improvement.

If you're thinking of extending your Ceph cluster to another DC it will
kill your latency, unless it's more or less next door.

> 
> Our services are:
> 
> * Webapplication as frontent
> * Database (Sybase / MariaDB Galera) as backend
> 
> All needed for doing transactions
> 
> 
> All we are planing is at this time more than we need, but for future 
> development and replacement for our old hardware stuff and software, we 
> want the best, we can get for our (approved) money :-)
> 
> So, here we are:
> 
> Starting with a six OSD node cluster, that are doing not only OSD stuff, 
> but also holding the mon services. 

Make sure to have fast (SSD) OS disks for the leveldb activities of
the MONs.
You should be fine for RAM.
CPU is probably fine, but the least predictable thing in something with 24
OSDs per node to be shared with a MON.

>We want to store data only via API so 
> a separated meta server isn't needed, as I understand all the documents 
> right.
> 
> 
> The first test hardware is:
> 
> *Motherboard: Asus Z10Pr-D16
> ** 
> https://www.asus.com/de/Commercial-Servers-Workstations/Z10PRD16/specifications/
> 
> * CPU: 2 x E5-2620v4
As Nick elaborated, you may fare better with faster, less cores depending
on your I/O patterns and latency needs.

> * Ram: 4 x 32GB DDR4 2400MHz
>
Sufficient.

> * Chassis: RSC-2AT0-80PG-SA3C-0BL-A
> ** http://www.aicipc.com/ProductSKU.aspx?ref=RSC-2AT
> ** Edition without Expander
> 
> * SAS: 1 x 9305-24i
> ** 
> http://www.avagotech.com/products/server-storage/host-bus-adapters/sas-9305-24i#specifications
> 
> * Storage NIC: 1 x Infiniband MCX314A-BCCT
> ** I red, that ConnectX-3 Pro is better supported, than the X-4 and a 
> bit cheaper
True.
> ** Switch: 2 x Mellanox SX6012 (56Gb/s)
> ** Active FC cables
Why? Surely this is in a rack or two?
> ** Maybe VPI is nice to have, but unsure.
> 
As pointed out, Ceph currently doesn't support IB natively, you have to
use IPoIB. Which benefits from fast CPUs and good PCIe slots.
And as a matter of fact, all my clusters use this, including the client
(compute node) connections. 

Things to consider here are:
1. No active-active bonding, only failover. So only one link (about 40Gb/s
effective after IPoIB overhead) 
2. Your cluster/storage network is already massively faster than what your
individual nodes can handle (1GB/s writes with your proposed single 400GB
NVMe).

I'm a big fan of IB, but unless you can standardize on it and go
end-to-end for everything with it, 2 different network stacks and cards
are just going to drive the costs up.

So in conclusion, loose the dedicated storage network and put the money
where it will do more good (decent SSDs). 

> * Production NIC: 1 x Intel 520 dual port SFP+
> ** Connected each to one of a HP 2920 10Gb/s ports via 802.3ad
> 
These things can do MC-LAG if my quick search is correct, so with both
switches up you have (about) 20Gb/s bandwidth to your OSD nodes.

Again, that's twice as fast as your journal NVMe bottleneck. 

So yeah, get 2 journal NVMes for bandwidth and redundancy purposes, use
the money saved by not having a cluster network for this.

> All nodes are connected over cross to every switch, so if one switch 
> goes down, a second path is available.
> 
> 
> * Disk:
> ** Storage: 24 x Crucial MX300 250GB (maybe for production 12xSSD / 12x 
> big Sata disks)
These things have a 40GB/day, 0.15 DWPD endurance. 
The worst Intel DC SSDs (S35xx) last twice as long.

And that's before any write amplification by Ceph (write patterns for
small objects) or the FS (journal) is factored in.

When (not if) all these SSDs die at the same and long before 5 years are
up, the reaction here will be "we told you so".

Unless you have a nearly read-only setup with VERY well known and
controlled write patterns/volume, you don't want to use those.
And your use case suggests otherwise.

Alternatively to Intel (again, search the ML) Samsung DC level models work
as well and can be cheaper.

Of course if you think about journals on these, I'm betting they will have
horrid (unusable) SYNC write performance.

> ** OSD journal: 1 x Intel SSD DC P3700 PCIe
> 
Which size? 
Because that both determines speed and endurance, though the later would
never be an issue if you were to use those Crucials above.

While basically a good choice, it is going to be your bottleneck,
especially if it's the 400GB model (most likely, given your budget
worries).

Consider 2 of those, to saturate your network as mentioned above.

> 
> One of the hardest part was the chassis with or without active expander, 
> so that we can use a "cheaper" HBA, like the 8i or something else.
I find that nearly all combinations I need or can think of are covered by
Supermicro.

> Also if we want/need a full raid controller like the Megaraid 
> sas-9361-8i, because of battery and cache. But it seems, that it isn't 
> really needed in our case. Sure, the cache is one of the benefits, but 
> maybe it is more complicated, than a plain HBA.
> 
The Areca controllers (and some others AFAIK) can use the cache when used
in HBA mode, with others you have to create single drive RAID0 volumes,
which is a PITA of course.

HW caches definitely can help, if they are worth the money is up to you.

> 
>  From the Ceph point of view, we want, that two OSD nodes can go down in 
> a worst case scenario, but keeping our business up (a bit slower is OK, 
> and expected). 

For that to work with default replication (size=3) you will need to tune
min_size from 2 to 1.
Which is fine with me but tends to make other people here break out in
hives and predicting the end of all days.
Alternatively you can go for 4x replication, with all the cost in storage
space and replication overhead that entails.

> Also if the nodes comes back, we are not down, because of 
> the replication stuff ;-)
> 
Not sure how to parse this sentence.

Do you mean "The design should be able to handle the recovery (backfill)
traffic from a node failure without significant impact on the client I/O
performance."?

If so, that's more of a configuration tuning thing, though beefy HW of
course helps.
Don't foresee any real problems with a pure SSD cluster, even un-tuned.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com