Hello, replying to the original post for quoting reasons. Totally agree with what the others (Nick and Burkhard) wrote. On Tue, 04 Oct 2016 15:43:18 +0200 Denny Fuchs wrote: > Hello, > > we are brand new to Ceph and planing it as our future storage for > KVM/LXC VMs as replacement for Xen / DRBD / Pacemaker / Synology (NFS) > stuff. > > > We have two goals: > > * High availability > * Short latency for our transaction services Search the ML archives, previous posts by Nick in particular. A lot of things are possible, but since for example reads are always local with DRBD you may be surprised by some performance results. > * For later: replication to different datacenter connected via 10Gb/s FC > If this is async replication via RBD mirroring you have a chance. Though this is brand new in Jewel and has quite some rough edges and large potential for improvement. If you're thinking of extending your Ceph cluster to another DC it will kill your latency, unless it's more or less next door. > > Our services are: > > * Webapplication as frontent > * Database (Sybase / MariaDB Galera) as backend > > All needed for doing transactions > > > All we are planing is at this time more than we need, but for future > development and replacement for our old hardware stuff and software, we > want the best, we can get for our (approved) money :-) > > So, here we are: > > Starting with a six OSD node cluster, that are doing not only OSD stuff, > but also holding the mon services. Make sure to have fast (SSD) OS disks for the leveldb activities of the MONs. You should be fine for RAM. CPU is probably fine, but the least predictable thing in something with 24 OSDs per node to be shared with a MON. >We want to store data only via API so > a separated meta server isn't needed, as I understand all the documents > right. > > > The first test hardware is: > > *Motherboard: Asus Z10Pr-D16 > ** > https://www.asus.com/de/Commercial-Servers-Workstations/Z10PRD16/specifications/ > > * CPU: 2 x E5-2620v4 As Nick elaborated, you may fare better with faster, less cores depending on your I/O patterns and latency needs. > * Ram: 4 x 32GB DDR4 2400MHz > Sufficient. > * Chassis: RSC-2AT0-80PG-SA3C-0BL-A > ** http://www.aicipc.com/ProductSKU.aspx?ref=RSC-2AT > ** Edition without Expander > > * SAS: 1 x 9305-24i > ** > http://www.avagotech.com/products/server-storage/host-bus-adapters/sas-9305-24i#specifications > > * Storage NIC: 1 x Infiniband MCX314A-BCCT > ** I red, that ConnectX-3 Pro is better supported, than the X-4 and a > bit cheaper True. > ** Switch: 2 x Mellanox SX6012 (56Gb/s) > ** Active FC cables Why? Surely this is in a rack or two? > ** Maybe VPI is nice to have, but unsure. > As pointed out, Ceph currently doesn't support IB natively, you have to use IPoIB. Which benefits from fast CPUs and good PCIe slots. And as a matter of fact, all my clusters use this, including the client (compute node) connections. Things to consider here are: 1. No active-active bonding, only failover. So only one link (about 40Gb/s effective after IPoIB overhead) 2. Your cluster/storage network is already massively faster than what your individual nodes can handle (1GB/s writes with your proposed single 400GB NVMe). I'm a big fan of IB, but unless you can standardize on it and go end-to-end for everything with it, 2 different network stacks and cards are just going to drive the costs up. So in conclusion, loose the dedicated storage network and put the money where it will do more good (decent SSDs). > * Production NIC: 1 x Intel 520 dual port SFP+ > ** Connected each to one of a HP 2920 10Gb/s ports via 802.3ad > These things can do MC-LAG if my quick search is correct, so with both switches up you have (about) 20Gb/s bandwidth to your OSD nodes. Again, that's twice as fast as your journal NVMe bottleneck. So yeah, get 2 journal NVMes for bandwidth and redundancy purposes, use the money saved by not having a cluster network for this. > All nodes are connected over cross to every switch, so if one switch > goes down, a second path is available. > > > * Disk: > ** Storage: 24 x Crucial MX300 250GB (maybe for production 12xSSD / 12x > big Sata disks) These things have a 40GB/day, 0.15 DWPD endurance. The worst Intel DC SSDs (S35xx) last twice as long. And that's before any write amplification by Ceph (write patterns for small objects) or the FS (journal) is factored in. When (not if) all these SSDs die at the same and long before 5 years are up, the reaction here will be "we told you so". Unless you have a nearly read-only setup with VERY well known and controlled write patterns/volume, you don't want to use those. And your use case suggests otherwise. Alternatively to Intel (again, search the ML) Samsung DC level models work as well and can be cheaper. Of course if you think about journals on these, I'm betting they will have horrid (unusable) SYNC write performance. > ** OSD journal: 1 x Intel SSD DC P3700 PCIe > Which size? Because that both determines speed and endurance, though the later would never be an issue if you were to use those Crucials above. While basically a good choice, it is going to be your bottleneck, especially if it's the 400GB model (most likely, given your budget worries). Consider 2 of those, to saturate your network as mentioned above. > > One of the hardest part was the chassis with or without active expander, > so that we can use a "cheaper" HBA, like the 8i or something else. I find that nearly all combinations I need or can think of are covered by Supermicro. > Also if we want/need a full raid controller like the Megaraid > sas-9361-8i, because of battery and cache. But it seems, that it isn't > really needed in our case. Sure, the cache is one of the benefits, but > maybe it is more complicated, than a plain HBA. > The Areca controllers (and some others AFAIK) can use the cache when used in HBA mode, with others you have to create single drive RAID0 volumes, which is a PITA of course. HW caches definitely can help, if they are worth the money is up to you. > > From the Ceph point of view, we want, that two OSD nodes can go down in > a worst case scenario, but keeping our business up (a bit slower is OK, > and expected). For that to work with default replication (size=3) you will need to tune min_size from 2 to 1. Which is fine with me but tends to make other people here break out in hives and predicting the end of all days. Alternatively you can go for 4x replication, with all the cost in storage space and replication overhead that entails. > Also if the nodes comes back, we are not down, because of > the replication stuff ;-) > Not sure how to parse this sentence. Do you mean "The design should be able to handle the recovery (backfill) traffic from a node failure without significant impact on the client I/O performance."? If so, that's more of a configuration tuning thing, though beefy HW of course helps. Don't foresee any real problems with a pure SSD cluster, even un-tuned. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com