Re: SAN or DAS for Production ceph

"Thomas White" <thomas@xxxxxxxxxxxxxx> · Tue, 28 Aug 2018 23:00:29 +0100

Hi James,

I can see where some of the confusion has arisen, hopefully I can put at least some of it to rest. In the Tumblr post from Yahoo, the keyword to look out for is “nodes”, which is distinct from individual hard drives which in Ceph is an OSD in most cases. So you would have multiple OSDs per node.

My quick napkin math would suggest that they are using 54 storage nodes, each holding 16 drives/OSDs (this doesn’t count the OS drives which aren’t specified in the post), as with the below math:

54 storage nodes providing 3.2PB of raw store requires ~59.25TB of storage per node
59.25TB / 12 = 4.94TB per OSD
59.25TB / 14 = 4.32TB per OSD
59.25TB / 16 = 3.70TB per OSD

Total OSDs per cluster = 864
EC Calculation: 8 / (8+3) = 72.73%

As they are using an 8/3 erasure coding configuration, that would provide an efficiency of 72.73% (see EC Calculation), so the usable capacity per storage cluster is around 2.33PB.

I haven’t included the calculation for anything below 12 as while it is possible, I find the 16 drive configuration most probable. As Ceph crush weight is shown using TiB, but most hard drives are marketed in TB due to the higher value, that would mean that 4TB drives are in use providing 3.63TiB of usable space on the drive. The math isn’t perfect here as you can see, but I’d think it is a safe assumption that they have at least a few higher capacity drives in there, or a wider mix of such standard commodity drive sizes with 4TB simply being a decent average.

For object storage clusters, particularly in use cases of high volumes of small objects, a standard OSD/node density is preferable which hovers between 10 and 16 OSDs per server depending who you ask (some reading on the subject courtesy of RedHat https://www.redhat.com/cms/managed-files/st-ceph-storage-qct-object-storage-reference-architecture-f7901-201706-v2-en.pdf).  As Yahoo’s workload is noting consistency and latency are some important metrics, they are also likely to use this density profile rather than something higher – this has the added benefit of quicker recovery times in the event of an individual OSD/host failure which is a parameter they tuned quite extensively.

For hashing algorithms and load balancing, I am not quite sure I understand your question, but RGW which implements object storage in Ceph has the ability to configure multiple zones/groups/regions, it might be best to have a read through the docs first:
http://docs.ceph.com/docs/luminous/radosgw/multisite/

Ceph is quite different from a SAN or DAS, and gives a great deal more flexibility too. If you are unsure on getting started and you need to hit the ground running strongly (ie a multi-PB production system), I’d really recommend getting a reliable consultant or taking out professional support services for it. Ceph is a piece of cake to manage when everything is working well, and very often this will be the case for a long time, but you will really value good planning and experience when you hit those rough patches.

Hope that helps,

Tom

From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> On Behalf Of James Watson
Sent: 28 August 2018 21:05
To: ceph-users@xxxxxxxxxxxxxx
Subject:  SAN or DAS for Production ceph

Dear cephers, 

I am new to the storage domain. 
Trying to get my head around the enterprise - production-ready setup. 

The following article helps a lot here: (Yahoo ceph implementation)
https://yahooeng.tumblr.com/tagged/object-storage

But a couple of questions:

What HDD would they have used here? NVMe / SATA /SAS etc (with just 52 storage node they got 3.2 PB of capacity !! )
I try to calculate a similar setup with HGST Ultrastar He12 (12TB and it's more recent ) and would need 86 HDDs that adds up to 1 PB only!!

How is the HDD drive attached is it DAS or a SAN (using Fibre Channel Switches, Host Bus Adapters etc)?

Do we need a proprietary hashing algorithm to implement multi-cluster based setup of ceph to contain CPU/Memory usage within the cluster when rebuilding happens during device failure?

If proprietary hashing algorithm is required to setup multi-cluster ceph using load balancer - then what could be the alternative setup we can deploy to address the same issue?

The aim is to design a similar architecture but with upgraded products and higher performance. - Any suggestions or thoughts are welcome 

Thanks in advance
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com