Re: Building a Pb EC cluster for a cheaper cold storage

Christian Balzer <chibi@xxxxxxx> · Wed, 11 Nov 2015 12:14:56 +0900

Hello,

On Tue, 10 Nov 2015 13:29:31 +0300 Mike Almateia wrote:

> Hello.
> 
> For our CCTV storing streams project we decided to use Ceph cluster with 
> EC pool.
> Input requirements is not scary: max. 15Gbit/s input traffic from CCTV, 
> 30 day storing,
> 99% write operations, a cluster must has grow up with out downtime.
> 
I have a production cluster that is also nearly write only.

I'd say that 1.5GB/s is a pretty significant amount of traffic, but not
scary in and by itself. 
The question is how many streams are we talking about, how are you writing
that data (to CephFS, RBD volumes)?

All of this will decide how IOPS intense (as opposed to throughput)
storing your streams will be.

> By now our vision of architecture it like:
> * 6 JBOD with 90 HDD 8Tb capacity each (540 HDD total)
> * 6 Ceph servers connected to it own JBOD (we will have 6 pairs: 1 
> Server + 1 JBOD).
> 
As you guessed yourself and as Paul suspects as well, I think the amount
of OSDs per node is too dense, more of a CPU than RAM problem, plus the
other tuning it will require.

Also the cache tier HDDs (unless they're SSDs) are likely going to be
another bottleneck.

Consider this alternative:

* Same JBOD chassis
* Quite different Ceph nodes:
- 1 or 2 RAID controllers with the most cache you can get (I like Areca's
  with 4GB, YMMV). That cache (and the journal SSDs suggested below)
  should take care of things if your 15GBit/s is sufficiently fragmented
  to cause large amounts of IOPS.
- 8x 11 disk RAID6, depending on how many controllers you have 1 or 2
  global hotspares. 
- 256GB RAM or more, tuned to .
- If you can afford it, use FAST SSDs (or NVMe) as journals. You want to
  be able to saturate your network, so around 2GB/s. 
  Four Intel DC S3700 400GB will get you close to that.
- Since you now only have 8 OSDs per node, your CPU requirements are more
  to the tune of 12 (fast, 2.5GHz++) cores.

With "failproof" OSDs, you can choose 2x (not the default 3x) replication.

Another bonus is that you'll likely never have a failed OSD and the
resulting traffic storm.

The trick to keep things happy here are to have enough RAM for all hot
objects that need to be read, especially inodes and other FS metadata.

Of course if you can afford it (price/space), having less dense nodes will
significantly reduce the impact of a node failure.

> Ceph servers hardware details:
> * 2 x E5-2690v3 : 24 core (w/o HT), 2.6 Ghz each
> * 256 Gb RAM DDR4
> * 4 x 10Gbit/s NIC port (2 for Client network and 2 for Cluster Network)
> * servers also have 4 (8) x 2.5" HDD SATA on board for Cache Tiering 
> Feature (because ceph clients can't directly talk with EC pool)
> * Two HBA SAS controllers work with multipathing feature, for HA
> scenario.
A bit of overkill, given how your failure domain will still be at least
per storage node, worse depending on network/switch topology.

Regards,

Christian

> * For Ceph monitor functionality 3 servers have 2 SSD in Software RAID1
> 
> Some Ceph configuration rules:
> * EC pools with K=7 and M=3
> * EC plugin - ISA
> * technique = reed_sol_van
> * ruleset-failure-domain = host
> * near full ratio = 0.75
> * OSD journal partition on the same disk
> 
> We think that first and second problems it will be CPU and RAM on Ceph 
> servers.
> 
> Any ideas? it is can fly?
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com