Re: New hardware for OSDs

Mattia Belluco <mattia.belluco@xxxxxx> · Mon, 27 Mar 2017 17:48:38 +0200

I mistakenly answered to Wido instead of the whole Mailing list ( weird
ml settings I suppose)

Here it is my message:

Thanks for replying so quickly. I commented inline.

On 03/27/2017 01:34 PM, Wido den Hollander wrote:
> 
>> Op 27 maart 2017 om 13:22 schreef Christian Balzer <chibi@xxxxxxx>:
>>
>>
>>
>> Hello,
>>
>> On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote:
>>
>>> Hello all,
>>> we are currently in the process of buying new hardware to expand an
>>> existing Ceph cluster that already has 1200 osds.
>>
>> That's quite sizable, is the expansion driven by the need for more space
>> (big data?) or to increase IOPS (or both)?
>>
>>> We are currently using 24 * 4 TB SAS drives per osd with an SSD journal
>>> shared among 4 osds. For the upcoming expansion we were thinking of
>>> switching to either 6 or 8 TB hard drives (9 or 12 per host) in order to
>>> drive down space and cost requirements.
>>>
>>> Has anyone any experience in mid-sized/large-sized deployment using such
>>> hard drives? Our main concern is the rebalance time but we might be
>>> overlooking some other aspects.
>>>
>>
>> If you researched the ML archives, you should already know to stay well
>> away from SMR HDDs. 
>>
> 
> Amen! Just don't. Stay away from SMR with Ceph.
> 
We were planning on using regular enterprise disks. No SMR :)
We are bit puzzled about the possible performance gain of the 4k native
ones but that's about it.

>> Both HGST and Seagate have large Enterprise HDDs that have
>> journals/caches (MediaCache in HGST speak IIRC) that drastically improve
>> write IOPS compared to plain HDDs.
>> Even with SSD journals you will want to consider those, as these new HDDs
>> will see at least twice the action than your current ones. 
>>
> 
> I also have good experiences with bcache on NVM-E device in Ceph clusters. A single Intel P3600/P3700 which is the caching device for bcache.
> 
No experience with those but I am a bit skeptical in including new
solutions in the current cluster as the current setup seems to work
quite well (no IOPS problem).
Those could be a nice solution for a new cluster, though.

>> Rebalance time is a concern of course, especially if your cluster like
>> most HDD based ones has these things throttled down to not impede actual
>> client I/O.
>>
>> To get a rough idea, take a look at:
>> https://www.memset.com/tools/raid-calculator/
>>
>> For Ceph with replication 3 and the typical PG distribution, assume 100
>> disks and the RAID6 with hotspares numbers are relevant.
>> For rebuild speed, consult your experience, you must have had a few
>> failures. ^o^
>>
>> For example with a recovery speed of 100MB/s, a 1TB disk (used data with
>> Ceph actually) looks decent at 1:16000 DLO/y. 
>> At 5TB though it enters scary land
>>
> 
> Yes, those recoveries will take a long time. Let's say your 6TB drive is filled for 80% you need to rebalance 4.8TB
> 
> 4.8TB / 100MB/sec = 13 hours rebuild time
> 
> 13 hours is a long time. And you will probably not have 100MB/sec sustained, I think that 50MB/sec is much more realistic.
> 
> That means that a single disk failure will take >24 hours to recover from a rebuild.
> 
> I don't like very big disks that much. Not in RAID, not in Ceph.
I don't think I am followinj the calculations. Maybe I need to provide a
few more details on our current network configuration:
each host (24 disks/osds) has 4 * 10 Gbit interfaces, 2 for client I/O
and 2 for the recovery network.
Rebalancing an OSD that was 50% full (2000GB) with the current setup
tool a little less than 30 mins. It would still take 1.5 hour to
rebalance 6 TB of data but that should still be reasonable,no?
What am I overlooking here?

>From our perspective having 9 * 8TB noded should provide a better
recovery time than the current 24 * 4TB ones if a whole node goes down
provide the rebalance is shared among several hundreds osds.

Thanks for any additional input.
Mattia

> 
> Wido
> 
>> Christian
>>
[snip]

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com