Re: 800TB - Ceph Physical Architecture Proposal

Christian Balzer <chibi@xxxxxxx> · Fri, 8 Apr 2016 11:47:41 +0900

Hello,

On Thu, 7 Apr 2016 09:24:46 -0500 Brady Deetz wrote:

> I'd appreciate any critique on the following plan.
> 
> Before I detail the plan, here are my current questions.
Traditionally they come below the plan/details. ^o^

> -----------------------------------------------------------
> 1) Am I under-powering the CPU on the proposed OSD node configuration?
> 
Massively. 

> 2) Will latency of roughly 300 micro seconds introduced by 10g-base-t
> deliver appreciably worse write performance than the approximate 850 nano
> seconds of latency introduced by 10 or 40g fiber?
>
Where's that number coming from? 
I was under the impression that the physical cable plays a minor role when
compared to the capabilities of the NICs/switches.
That said, latency is bad and to be avoided at "all" cost.

> 3) I currently don't have a cache tier built into this design. What would
> be the best approach to adding an SSD cache tier at this scale?
> 
Winning the lottery first. 
Failing that, cache tiering with Jewel and beyond can significantly
improve things, as long as your truly hot objects (those that get written
to "constantly") can fit in your SSD cache pool.
This will be helped by working read-recency or the readforward cache mode
(only writes will promote objects into the cache pool, reads come from the
base pool if the object isn't already in the cache pool).
>From what you write below I would aim for at least 8TB net data capacity
in such a cache pool.
If you were to use dedicated storage nodes for this, THEN a significantly
faster uplink and interconnect compared to your HDD OSD nodes would be
advisable. 

> 4) What have I not considered?
The Spanish inquisition.

> 
> 5) Am I insane for thinking this is a good idea?
Not more than the rest of the people who're using OSS for mission critical
data storage...

> -----------------------------------------------------------
> 
> The Plan:
> I'm in the processes of designing a large Ceph deployment for my brain
> research organization. Currently, we are using Isilon and Oracle ZFS for
> file storage and a small amount of VMware. On Isilon we are using roughly
> 80TB and on Oracle we are using roughly 275TB. The goal would be to
> consolidate all storage into a single environment with about 800TB usable
> with 3 replicas (probably not using erasure coding due to documented
> performance impact).
> 
> About 80 of the 275TB on ZFS are larger MRI DICOM and BRIK files that
> could be moved to RADOS.
> 
> The remaining 200ish TB of that data would need to be online file
> storage, accessible over cifs and nfs. Roughly 30TB of that 200TB will
> have been hot in the past month. At any given moment, about 4TB of files
> are hot. Right now I'm considering CephFS vs RBD with a filesystem on
> top.
> 
There are efforts for seamless integration of NFS and CIFS into Ceph, but
I'm no expert on those, read up on them (google).
You may of course be even better off if you can standardize on CephFS if
possible, but you seem to have a diverse environment. 

> For interconnect (east-west), I am weighing options between 10g sfp+,
> 10g-base-t, and 40g qsfp+ (all from the Brocade VDX or Mellanox lines).
> For north-south access to the storage, 2x 40g is what I've landed on.
> 
When I first read that paragraph I thought you were talking about
geographical distributed DCs. ^_-
All my clusters are using a flat QDR Infiniband (IPoIB, rDMA support is
forthcoming in Ceph) network for storage, didn't bother with separate
public (clients) and private (replication) networks. 
I'm quite happy with the performance and price, IB switches are (or at
least were, this is changing with things like Cumulus) significantly
cheaper than 10Gb/s Ethernet ones.

I'd try to standardize things, KISS is a very good principle. 
And you do NOT want single links in either network.
Ideally something like vLAG, mLAG, TRILL that will give you redundancy AND
link aggregation.

That being said, something like 4x10Gb/s (2 each for replication and
clients) if your cluster has a mixed load or is write heavy (you will want
to be able to replicate the data as fast as your clients deliver them).
Or something like 2x10Gb/s for replication and 4x10Gb/s for clients if
your cluster is read-heavy.

Why not 40Gb/s? See below.

> 11 OSD nodes:
> -SuperMicro 6047R-E1R36L
> --2x E5-2603v2
Vastly underpowered for 36 OSDs.
> --128GB RAM
> --36x 6TB OSD
> --2x Intel P3700 (journals)
Which exact model?
If it's the 400GB one, that's 2GB/s maximum write speed combined.
Slightly below what I'd expect your 36 HDDs to be able to write at about
2.5GB/s (36*70MB/s), but not unreasonably so.
However your initial network thoughts are massively overspec'ed for this
kind of performance.

> 
> 3 MDS nodes:
> -SuperMicro 1028TP-DTR (one node from scale-out chassis)
> --2x E5-2630v4
> --128GB RAM
> --2x 120GB SSD (RAID 1 for OS)
Not using CephFS, but if the MDS are like all the other Ceph bits (MONs in
particular) they are likely to do happy writes to leveldbs or the likes, do
verify that.
If that's the case, fast and durable SSDs will be needed.

> 
> 5 MON nodes:
> -SuperMicro 1028TP-DTR (one node from scale-out chassis)
> --2x E5-2630v4
> --128GB RAM
> --2x 120GB SSD (RAID 1 for OS)
> 
Total overkill, are you sure you didn't mix up the CPUs for the OSDs with
the ones for the MONs?
Also, while dedicated MONs are nice, they really can live rather frugally,
except for the lust for fast, durable storage.
If I were you, I'd get 2 dedicated MON nodes (with few, fastish cores) and
32-64GB RAM, then put the other 3 on your MDS nodes which seem to have
plenty resources to go around.
You will want the dedicated MONs to have the lowest IPs in your network,
the monitor leader is chosen by that.

Christian
> We'd use our existing Zabbix deployment for monitoring and ELK for log
> aggregation.
> 
> Provisioning would be through puppet-razor (PXE) and puppet.
> 
> Again, thank you for any information you can provide
> 
> --Brady

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com