Re: New hardware for OSDs

Nick Fisk <nick@xxxxxxxxxx> · Tue, 28 Mar 2017 20:43:20 +0100

Hi Christian,

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Christian Balzer
> Sent: 28 March 2017 00:59
> To: ceph-users@xxxxxxxxxxxxxx
> Cc: Nick Fisk <nick@xxxxxxxxxx>
> Subject: Re:  New hardware for OSDs
> 
> 
> Hello,
> 
> On Mon, 27 Mar 2017 16:09:09 +0100 Nick Fisk wrote:
> 
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > Behalf Of Wido den Hollander
> > > Sent: 27 March 2017 12:35
> > > To: ceph-users@xxxxxxxxxxxxxx; Christian Balzer <chibi@xxxxxxx>
> > > Subject: Re:  New hardware for OSDs
> > >
> > >
> > > > Op 27 maart 2017 om 13:22 schreef Christian Balzer <chibi@xxxxxxx>:
> > > >
> > > >
> > > >
> > > > Hello,
> > > >
> > > > On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote:
> > > >
> > > > > Hello all,
> > > > > we are currently in the process of buying new hardware to expand
> > > > > an existing Ceph cluster that already has 1200 osds.
> > > >
> > > > That's quite sizable, is the expansion driven by the need for more
> > > > space (big data?) or to increase IOPS (or both)?
> > > >
> > > > > We are currently using 24 * 4 TB SAS drives per osd with an SSD
> > > > > journal shared among 4 osds. For the upcoming expansion we were
> > > > > thinking of switching to either 6 or 8 TB hard drives (9 or 12
> > > > > per
> > > > > host) in order to drive down space and cost requirements.
> > > > >
> > > > > Has anyone any experience in mid-sized/large-sized deployment
> > > > > using such hard drives? Our main concern is the rebalance time
> > > > > but we might be overlooking some other aspects.
> > > > >
> > > >
> > > > If you researched the ML archives, you should already know to stay
> > > > well away from SMR HDDs.
> > > >
> > >
> > > Amen! Just don't. Stay away from SMR with Ceph.
> > >
> > > > Both HGST and Seagate have large Enterprise HDDs that have
> > > > journals/caches (MediaCache in HGST speak IIRC) that drastically
> > > > improve write IOPS compared to plain HDDs.
> > > > Even with SSD journals you will want to consider those, as these
> > > > new HDDs will see at least twice the action than your current ones.
> > > >
> >
> > I've got a mixture of WD Red Pro 6TB and HGST He8 8TB drives. Recovery
> > for ~70% full disks takes around 3-4 hours, this is for a cluster
> > containing 60 OSD's. I'm usually seeing recovery speeds up around 1GB/s
> or more.
> >
> Good data point.
> 
> How busy is your cluster at those times, client I/O impact?

Its normally around 20-30% busy through most parts of the day. No real
impact to client IO. Its backup data, so buffered IO coming in via wan
circuit.

> 
> > Depends on your workload, mine is for archiving/backups so big disks
> > are a must. I wouldn't recommend using them for more active workloads
> > unless you are planning a beefy cache tier or some other sort of caching
> solution.
> >
> > The He8 (and He10) drives also use a fair bit less power due to less
> > friction, but I think this only applies to the sata model. My 12x3.5
> > 8TB node with CPU...etc uses ~140W at idle. Hoping to get this down
> > further with a new Xeon-D design on next expansion phase.
> >
> > The only thing I will say about big disks is beware of cold FS
> > inodes/dentry's and PG splitting. The former isn't a problem if you
> > will only be actively accessing a small portion of your data, but I
> > see increases in latency if I access cold data even with VFS cache
pressure
> set to 1.
> > Currently investigating using bcache under the OSD to try and cache
this.
> >
> 
> I've seen this kind of behavior on my (non-Ceph) mailbox servers.
> As in, the maximum SLAB space may not be large enough to hold all inodes
> or the pagecache will eat into it over time when not constantly
referenced,
> despite cache pressure settings.
> 
> > PG splitting becomes a problem when the disks start to fill up,
> > playing with the split/merge thresholds may help, but you have to be
> > careful you don't end up with massive splits when they do finally
> > happen, as otherwise OSD's start timing out.
> >
> Getting this right (and predictable) is one of the darker arts with Ceph.
> OTOH it will go away with Bluestore (just to be replaced by other oddities
no
> doubt).
> 
> > >
> > > I also have good experiences with bcache on NVM-E device in Ceph
> clusters.
> > > A single Intel P3600/P3700 which is the caching device for bcache.
> > >
> > > > Rebalance time is a concern of course, especially if your cluster
> > > > like most HDD based ones has these things throttled down to not
> > > > impede actual client I/O.
> > > >
> > > > To get a rough idea, take a look at:
> > > > https://www.memset.com/tools/raid-calculator/
> > > >
> > > > For Ceph with replication 3 and the typical PG distribution,
> > > > assume
> > > > 100 disks and the RAID6 with hotspares numbers are relevant.
> > > > For rebuild speed, consult your experience, you must have had a
> > > > few failures. ^o^
> > > >
> > > > For example with a recovery speed of 100MB/s, a 1TB disk (used
> > > > data with Ceph actually) looks decent at 1:16000 DLO/y.
> > > > At 5TB though it enters scary land
> > > >
> > >
> > > Yes, those recoveries will take a long time. Let's say your 6TB
> > > drive is
> > filled for
> > > 80% you need to rebalance 4.8TB
> > >
> > > 4.8TB / 100MB/sec = 13 hours rebuild time
> > >
> > > 13 hours is a long time. And you will probably not have 100MB/sec
> > > sustained, I think that 50MB/sec is much more realistic.
> >
> > Are we talking backfill or recovery here? Recovery will go at the
> > combined speed of all the disks in the cluster. If the OP's cluster is
> > already at
> > 1200 OSD's, a single disk will be a tiny percentage per OSD to
> > recover. But yes, backfill will probably crawl along at 50MB/s, but is
this a
> problem?
> >
> 
> All disks?
> I picked the 100 disks up there based on typical/recommended PG loads of
> OSDs.
> So the data of a failed disk will will have at most about 100 sources to
work
> with during recovery.
> I'd expect it to be fast in such a large cluster, too.
> But how fast, lots of variables and load of the cluster being a major one.

Yes sorry, you right, this will depend on the number of PG's per OSD.
Although I guess this lends further towards what I have discussed before
about it being more suited to being a PG to TB ratio, which also helps with
PG splitting as well.

> 
> Regarding backfill, I've seen frequently that at some point during
backfills all
> of a sudden a tiny amount of objects become degraded (usually a fraction
of
> a %).
> And these often linger around for a looong time before being recovered.

Yes, I've often wondered if these are actual degraded objects that have some
how become degraded during the backfill process, or if this is some simple
accounting error, that is cleared when then PG goes completely active on the
new OSD. Probably a task for a slow day to look into the state of the
objects and see if they are actually degraded or not.

> 
> Not the best example, but one I still had the terminal window open for:
> ---
>     cluster bfefde1c-8abf-47d6-816a-3c97f12b5aeb
>      health HEALTH_WARN
>             422 pgs backfill
>             7 pgs backfilling
>             429 pgs stuck unclean
>             recovery 172/3691691 objects degraded (0.005%)
>             recovery 1546807/3691691 objects misplaced (41.900%)
>      monmap e2: 4 mons at {ceph-01=10.0.8.21:6789/0,ceph-
> 02=10.0.8.22:6789/0,ceph-03=10.0.8.23:6789/0,ceph-04=10.0.8.24:6789/0}
>             election epoch 210, quorum 0,1,2,3
ceph-01,ceph-02,ceph-03,ceph-04
>      osdmap e16080: 32 osds: 32 up, 32 in; 429 remapped pgs
>       pgmap v50982378: 1024 pgs, 1 pools, 3779 GB data, 949 kobjects
>             11290 GB used, 66707 GB / 82110 GB avail
>             172/3691691 objects degraded (0.005%)
>             1546807/3691691 objects misplaced (41.900%)
>                  595 active+clean
>                  422 active+remapped+wait_backfill
>                    7 active+remapped+backfilling recovery io 260 MB/s, 65
objects/s
> ---
> 
> Penultimately I'm not happy with a cluster that's not 100% healthy for
> prolonged time.
> 
> Christian
> > >
> > > That means that a single disk failure will take >24 hours to recover
> > > from
> > a
> > > rebuild.
> > >
> > > I don't like very big disks that much. Not in RAID, not in Ceph.
> > >
> > > Wido
> > >
> > > > Christian
> > > >
> > > > > We currently use the cluster as storage for openstack services:
> > > > > Glance, Cinder and VMs' ephemeral disks.
> > > > >
> > > > > Thanks in advance for any advice.
> > > > >
> > > > > Mattia
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > >
> > > >
> > > > --
> > > > Christian Balzer        Network/Systems Engineer
> > > > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > > > http://www.gol.com/
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com