Re: New hardware for OSDs

Wido den Hollander <wido@xxxxxxxx> · Mon, 27 Mar 2017 13:34:30 +0200 (CEST)

> Op 27 maart 2017 om 13:22 schreef Christian Balzer <chibi@xxxxxxx>:
> 
> 
> 
> Hello,
> 
> On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote:
> 
> > Hello all,
> > we are currently in the process of buying new hardware to expand an
> > existing Ceph cluster that already has 1200 osds.
> 
> That's quite sizable, is the expansion driven by the need for more space
> (big data?) or to increase IOPS (or both)?
> 
> > We are currently using 24 * 4 TB SAS drives per osd with an SSD journal
> > shared among 4 osds. For the upcoming expansion we were thinking of
> > switching to either 6 or 8 TB hard drives (9 or 12 per host) in order to
> > drive down space and cost requirements.
> > 
> > Has anyone any experience in mid-sized/large-sized deployment using such
> > hard drives? Our main concern is the rebalance time but we might be
> > overlooking some other aspects.
> > 
> 
> If you researched the ML archives, you should already know to stay well
> away from SMR HDDs. 
> 

Amen! Just don't. Stay away from SMR with Ceph.

> Both HGST and Seagate have large Enterprise HDDs that have
> journals/caches (MediaCache in HGST speak IIRC) that drastically improve
> write IOPS compared to plain HDDs.
> Even with SSD journals you will want to consider those, as these new HDDs
> will see at least twice the action than your current ones. 
> 

I also have good experiences with bcache on NVM-E device in Ceph clusters. A single Intel P3600/P3700 which is the caching device for bcache.

> Rebalance time is a concern of course, especially if your cluster like
> most HDD based ones has these things throttled down to not impede actual
> client I/O.
> 
> To get a rough idea, take a look at:
> https://www.memset.com/tools/raid-calculator/
> 
> For Ceph with replication 3 and the typical PG distribution, assume 100
> disks and the RAID6 with hotspares numbers are relevant.
> For rebuild speed, consult your experience, you must have had a few
> failures. ^o^
> 
> For example with a recovery speed of 100MB/s, a 1TB disk (used data with
> Ceph actually) looks decent at 1:16000 DLO/y. 
> At 5TB though it enters scary land
> 

Yes, those recoveries will take a long time. Let's say your 6TB drive is filled for 80% you need to rebalance 4.8TB

4.8TB / 100MB/sec = 13 hours rebuild time

13 hours is a long time. And you will probably not have 100MB/sec sustained, I think that 50MB/sec is much more realistic.

That means that a single disk failure will take >24 hours to recover from a rebuild.

I don't like very big disks that much. Not in RAID, not in Ceph.

Wido

> Christian
> 
> > We currently use the cluster as storage for openstack services: Glance,
> > Cinder and VMs' ephemeral disks.
> > 
> > Thanks in advance for any advice.
> > 
> > Mattia
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com