800TB - Ceph Physical Architecture Proposal

Brady Deetz <bdeetz@xxxxxxxxx> · Thu, 7 Apr 2016 09:24:46 -0500

I'd appreciate any critique on the following plan.

Before I detail the plan, here are my current questions.
-----------------------------------------------------------
1) Am I under-powering the CPU on the proposed OSD node configuration?

2) Will latency of roughly 300 micro seconds introduced by 10g-base-t deliver appreciably worse write performance than the approximate 850 nano seconds of latency introduced by 10 or 40g fiber?

3) I currently don't have a cache tier built into this design. What would be the best approach to adding an SSD cache tier at this scale?

4) What have I not considered?

5) Am I insane for thinking this is a good idea?
-----------------------------------------------------------

The Plan:
I'm in the processes of designing a large Ceph deployment for my brain research organization. Currently, we are using Isilon and Oracle ZFS for file storage and a small amount of VMware. On Isilon we are using roughly 80TB and on Oracle we are using roughly 275TB. The goal would be to consolidate all storage into a single environment with about 800TB usable with 3 replicas (probably not using erasure coding due to documented performance impact).
About 80 of the 275TB on ZFS are larger MRI DICOM and BRIK files that could be moved to RADOS.

The remaining 200ish TB of that data would need to be online file storage, accessible over cifs and nfs. Roughly 30TB of that 200TB will have been hot in the past month. At any given moment, about 4TB of files are hot. Right now I'm considering CephFS vs RBD with a filesystem on top.

For interconnect (east-west), I am weighing options between 10g sfp+, 10g-base-t, and 40g qsfp+ (all from the Brocade VDX or Mellanox lines). For north-south access to the storage, 2x 40g is what I've landed on.

11 OSD nodes:
-SuperMicro 6047R-E1R36L
--2x E5-2603v2
--128GB RAM
--36x 6TB OSD
--2x Intel P3700 (journals)

3 MDS nodes:
-SuperMicro 1028TP-DTR (one node from scale-out chassis)
--2x E5-2630v4
--128GB RAM
--2x 120GB SSD (RAID 1 for OS)

5 MON nodes:
-SuperMicro 1028TP-DTR (one node from scale-out chassis)
--2x E5-2630v4
--128GB RAM
--2x 120GB SSD (RAID 1 for OS)

We'd use our existing Zabbix deployment for monitoring and ELK for log aggregation.

Provisioning would be through puppet-razor (PXE) and puppet.

Again, thank you for any information you can provide

--Brady
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com