Re: New hardware for OSDs

Nick Fisk <nick@xxxxxxxxxx> · Mon, 27 Mar 2017 16:09:09 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Wido den Hollander
> Sent: 27 March 2017 12:35
> To: ceph-users@xxxxxxxxxxxxxx; Christian Balzer <chibi@xxxxxxx>
> Subject: Re:  New hardware for OSDs
> 
> 
> > Op 27 maart 2017 om 13:22 schreef Christian Balzer <chibi@xxxxxxx>:
> >
> >
> >
> > Hello,
> >
> > On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote:
> >
> > > Hello all,
> > > we are currently in the process of buying new hardware to expand an
> > > existing Ceph cluster that already has 1200 osds.
> >
> > That's quite sizable, is the expansion driven by the need for more
> > space (big data?) or to increase IOPS (or both)?
> >
> > > We are currently using 24 * 4 TB SAS drives per osd with an SSD
> > > journal shared among 4 osds. For the upcoming expansion we were
> > > thinking of switching to either 6 or 8 TB hard drives (9 or 12 per
> > > host) in order to drive down space and cost requirements.
> > >
> > > Has anyone any experience in mid-sized/large-sized deployment using
> > > such hard drives? Our main concern is the rebalance time but we
> > > might be overlooking some other aspects.
> > >
> >
> > If you researched the ML archives, you should already know to stay
> > well away from SMR HDDs.
> >
> 
> Amen! Just don't. Stay away from SMR with Ceph.
> 
> > Both HGST and Seagate have large Enterprise HDDs that have
> > journals/caches (MediaCache in HGST speak IIRC) that drastically
> > improve write IOPS compared to plain HDDs.
> > Even with SSD journals you will want to consider those, as these new
> > HDDs will see at least twice the action than your current ones.
> >

I've got a mixture of WD Red Pro 6TB and HGST He8 8TB drives. Recovery for
~70% full disks takes around 3-4 hours, this is for a cluster containing 60
OSD's. I'm usually seeing recovery speeds up around 1GB/s or more.

Depends on your workload, mine is for archiving/backups so big disks are a
must. I wouldn't recommend using them for more active workloads unless you
are planning a beefy cache tier or some other sort of caching solution.

The He8 (and He10) drives also use a fair bit less power due to less
friction, but I think this only applies to the sata model. My 12x3.5 8TB
node with CPU...etc uses ~140W at idle. Hoping to get this down further with
a new Xeon-D design on next expansion phase.

The only thing I will say about big disks is beware of cold FS
inodes/dentry's and PG splitting. The former isn't a problem if you will
only be actively accessing a small portion of your data, but I see increases
in latency if I access cold data even with VFS cache pressure set to 1.
Currently investigating using bcache under the OSD to try and cache this.

PG splitting becomes a problem when the disks start to fill up, playing with
the split/merge thresholds may help, but you have to be careful you don't
end up with massive splits when they do finally happen, as otherwise OSD's
start timing out.

> 
> I also have good experiences with bcache on NVM-E device in Ceph clusters.
> A single Intel P3600/P3700 which is the caching device for bcache.
> 
> > Rebalance time is a concern of course, especially if your cluster like
> > most HDD based ones has these things throttled down to not impede
> > actual client I/O.
> >
> > To get a rough idea, take a look at:
> > https://www.memset.com/tools/raid-calculator/
> >
> > For Ceph with replication 3 and the typical PG distribution, assume
> > 100 disks and the RAID6 with hotspares numbers are relevant.
> > For rebuild speed, consult your experience, you must have had a few
> > failures. ^o^
> >
> > For example with a recovery speed of 100MB/s, a 1TB disk (used data
> > with Ceph actually) looks decent at 1:16000 DLO/y.
> > At 5TB though it enters scary land
> >
> 
> Yes, those recoveries will take a long time. Let's say your 6TB drive is
filled for
> 80% you need to rebalance 4.8TB
> 
> 4.8TB / 100MB/sec = 13 hours rebuild time
> 
> 13 hours is a long time. And you will probably not have 100MB/sec
> sustained, I think that 50MB/sec is much more realistic.

Are we talking backfill or recovery here? Recovery will go at the combined
speed of all the disks in the cluster. If the OP's cluster is already at
1200 OSD's, a single disk will be a tiny percentage per OSD to recover. But
yes, backfill will probably crawl along at 50MB/s, but is this a problem?

> 
> That means that a single disk failure will take >24 hours to recover from
a
> rebuild.
> 
> I don't like very big disks that much. Not in RAID, not in Ceph.
> 
> Wido
> 
> > Christian
> >
> > > We currently use the cluster as storage for openstack services:
> > > Glance, Cinder and VMs' ephemeral disks.
> > >
> > > Thanks in advance for any advice.
> > >
> > > Mattia
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com