Re: New hardware for OSDs

Christian Balzer <chibi@xxxxxxx> · Tue, 28 Mar 2017 08:58:50 +0900

Hello,

On Mon, 27 Mar 2017 16:09:09 +0100 Nick Fisk wrote:

> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> > Wido den Hollander
> > Sent: 27 March 2017 12:35
> > To: ceph-users@xxxxxxxxxxxxxx; Christian Balzer <chibi@xxxxxxx>
> > Subject: Re:  New hardware for OSDs
> > 
> >   
> > > Op 27 maart 2017 om 13:22 schreef Christian Balzer <chibi@xxxxxxx>:
> > >
> > >
> > >
> > > Hello,
> > >
> > > On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote:
> > >  
> > > > Hello all,
> > > > we are currently in the process of buying new hardware to expand an
> > > > existing Ceph cluster that already has 1200 osds.  
> > >
> > > That's quite sizable, is the expansion driven by the need for more
> > > space (big data?) or to increase IOPS (or both)?
> > >  
> > > > We are currently using 24 * 4 TB SAS drives per osd with an SSD
> > > > journal shared among 4 osds. For the upcoming expansion we were
> > > > thinking of switching to either 6 or 8 TB hard drives (9 or 12 per
> > > > host) in order to drive down space and cost requirements.
> > > >
> > > > Has anyone any experience in mid-sized/large-sized deployment using
> > > > such hard drives? Our main concern is the rebalance time but we
> > > > might be overlooking some other aspects.
> > > >  
> > >
> > > If you researched the ML archives, you should already know to stay
> > > well away from SMR HDDs.
> > >  
> > 
> > Amen! Just don't. Stay away from SMR with Ceph.
> >   
> > > Both HGST and Seagate have large Enterprise HDDs that have
> > > journals/caches (MediaCache in HGST speak IIRC) that drastically
> > > improve write IOPS compared to plain HDDs.
> > > Even with SSD journals you will want to consider those, as these new
> > > HDDs will see at least twice the action than your current ones.
> > >  
> 
> I've got a mixture of WD Red Pro 6TB and HGST He8 8TB drives. Recovery for
> ~70% full disks takes around 3-4 hours, this is for a cluster containing 60
> OSD's. I'm usually seeing recovery speeds up around 1GB/s or more.
> 
Good data point.

How busy is your cluster at those times, client I/O impact?

> Depends on your workload, mine is for archiving/backups so big disks are a
> must. I wouldn't recommend using them for more active workloads unless you
> are planning a beefy cache tier or some other sort of caching solution.
> 
> The He8 (and He10) drives also use a fair bit less power due to less
> friction, but I think this only applies to the sata model. My 12x3.5 8TB
> node with CPU...etc uses ~140W at idle. Hoping to get this down further with
> a new Xeon-D design on next expansion phase.
> 
> The only thing I will say about big disks is beware of cold FS
> inodes/dentry's and PG splitting. The former isn't a problem if you will
> only be actively accessing a small portion of your data, but I see increases
> in latency if I access cold data even with VFS cache pressure set to 1.
> Currently investigating using bcache under the OSD to try and cache this.
> 

I've seen this kind of behavior on my (non-Ceph) mailbox servers. 
As in, the maximum SLAB space may not be large enough to hold all inodes
or the pagecache will eat into it over time when not constantly
referenced, despite cache pressure settings.

> PG splitting becomes a problem when the disks start to fill up, playing with
> the split/merge thresholds may help, but you have to be careful you don't
> end up with massive splits when they do finally happen, as otherwise OSD's
> start timing out.
> 
Getting this right (and predictable) is one of the darker arts with Ceph.
OTOH it will go away with Bluestore (just to be replaced by other oddities
no doubt).

> > 
> > I also have good experiences with bcache on NVM-E device in Ceph clusters.
> > A single Intel P3600/P3700 which is the caching device for bcache.
> >   
> > > Rebalance time is a concern of course, especially if your cluster like
> > > most HDD based ones has these things throttled down to not impede
> > > actual client I/O.
> > >
> > > To get a rough idea, take a look at:
> > > https://www.memset.com/tools/raid-calculator/
> > >
> > > For Ceph with replication 3 and the typical PG distribution, assume
> > > 100 disks and the RAID6 with hotspares numbers are relevant.
> > > For rebuild speed, consult your experience, you must have had a few
> > > failures. ^o^
> > >
> > > For example with a recovery speed of 100MB/s, a 1TB disk (used data
> > > with Ceph actually) looks decent at 1:16000 DLO/y.
> > > At 5TB though it enters scary land
> > >  
> > 
> > Yes, those recoveries will take a long time. Let's say your 6TB drive is  
> filled for
> > 80% you need to rebalance 4.8TB
> > 
> > 4.8TB / 100MB/sec = 13 hours rebuild time
> > 
> > 13 hours is a long time. And you will probably not have 100MB/sec
> > sustained, I think that 50MB/sec is much more realistic.  
> 
> Are we talking backfill or recovery here? Recovery will go at the combined
> speed of all the disks in the cluster. If the OP's cluster is already at
> 1200 OSD's, a single disk will be a tiny percentage per OSD to recover. But
> yes, backfill will probably crawl along at 50MB/s, but is this a problem?
> 

All disks?
I picked the 100 disks up there based on typical/recommended PG loads of
OSDs. 
So the data of a failed disk will will have at most about 100 sources to
work with during recovery.
I'd expect it to be fast in such a large cluster, too.
But how fast, lots of variables and load of the cluster being a major one.

Regarding backfill, I've seen frequently that at some point during
backfills all of a sudden a tiny amount of objects become degraded (usually
a fraction of a %).
And these often linger around for a looong time before being recovered. 

Not the best example, but one I still had the terminal window open for:
---
    cluster bfefde1c-8abf-47d6-816a-3c97f12b5aeb
     health HEALTH_WARN
            422 pgs backfill
            7 pgs backfilling
            429 pgs stuck unclean
            recovery 172/3691691 objects degraded (0.005%)
            recovery 1546807/3691691 objects misplaced (41.900%)
     monmap e2: 4 mons at {ceph-01=10.0.8.21:6789/0,ceph-02=10.0.8.22:6789/0,ceph-03=10.0.8.23:6789/0,ceph-04=10.0.8.24:6789/0}
            election epoch 210, quorum 0,1,2,3 ceph-01,ceph-02,ceph-03,ceph-04
     osdmap e16080: 32 osds: 32 up, 32 in; 429 remapped pgs
      pgmap v50982378: 1024 pgs, 1 pools, 3779 GB data, 949 kobjects
            11290 GB used, 66707 GB / 82110 GB avail
            172/3691691 objects degraded (0.005%)
            1546807/3691691 objects misplaced (41.900%)
                 595 active+clean
                 422 active+remapped+wait_backfill
                   7 active+remapped+backfilling
recovery io 260 MB/s, 65 objects/s
---

Penultimately I'm not happy with a cluster that's not 100% healthy for
prolonged time.

Christian
> > 
> > That means that a single disk failure will take >24 hours to recover from  
> a
> > rebuild.
> > 
> > I don't like very big disks that much. Not in RAID, not in Ceph.
> > 
> > Wido
> >   
> > > Christian
> > >  
> > > > We currently use the cluster as storage for openstack services:
> > > > Glance, Cinder and VMs' ephemeral disks.
> > > >
> > > > Thanks in advance for any advice.
> > > >
> > > > Mattia
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >  
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com