Re: New hardware for OSDs

Christian Balzer <chibi@xxxxxxx> · Tue, 28 Mar 2017 09:31:39 +0900

Hello,

On Mon, 27 Mar 2017 17:48:38 +0200 Mattia Belluco wrote:

> I mistakenly answered to Wido instead of the whole Mailing list ( weird
> ml settings I suppose)
> 
> Here it is my message:
> 
> 
> Thanks for replying so quickly. I commented inline.
> 
> On 03/27/2017 01:34 PM, Wido den Hollander wrote:
> >   
> >> Op 27 maart 2017 om 13:22 schreef Christian Balzer <chibi@xxxxxxx>:
> >>
> >>
> >>
> >> Hello,
> >>
> >> On Mon, 27 Mar 2017 12:27:40 +0200 Mattia Belluco wrote:
> >>  
> >>> Hello all,
> >>> we are currently in the process of buying new hardware to expand an
> >>> existing Ceph cluster that already has 1200 osds.  
> >>
> >> That's quite sizable, is the expansion driven by the need for more space
> >> (big data?) or to increase IOPS (or both)?
> >>  
> >>> We are currently using 24 * 4 TB SAS drives per osd with an SSD journal
> >>> shared among 4 osds. For the upcoming expansion we were thinking of
> >>> switching to either 6 or 8 TB hard drives (9 or 12 per host) in order to
> >>> drive down space and cost requirements.
> >>>
> >>> Has anyone any experience in mid-sized/large-sized deployment using such
> >>> hard drives? Our main concern is the rebalance time but we might be
> >>> overlooking some other aspects.
> >>>  
> >>
> >> If you researched the ML archives, you should already know to stay well
> >> away from SMR HDDs. 
> >>  
> > 
> > Amen! Just don't. Stay away from SMR with Ceph.
> >   
> We were planning on using regular enterprise disks. No SMR :)
> We are bit puzzled about the possible performance gain of the 4k native
> ones but that's about it.
> 
AFAIK Linux will even with 512e (4K native, 512B emulation) drives do the
right thing [TM].

> >> Both HGST and Seagate have large Enterprise HDDs that have
> >> journals/caches (MediaCache in HGST speak IIRC) that drastically improve
> >> write IOPS compared to plain HDDs.
> >> Even with SSD journals you will want to consider those, as these new HDDs
> >> will see at least twice the action than your current ones. 
> >>  
> > 
> > I also have good experiences with bcache on NVM-E device in Ceph clusters. A single Intel P3600/P3700 which is the caching device for bcache.
> >   
> No experience with those but I am a bit skeptical in including new
> solutions in the current cluster as the current setup seems to work
> quite well (no IOPS problem).
> Those could be a nice solution for a new cluster, though.
> 
I have no experiences (or no current ones at last) with those either and a
new cluster (as in late this year or early next year) would likely to be
Bluestore based and thus have different needs, tuning knobs, etc.

> 
> >> Rebalance time is a concern of course, especially if your cluster like
> >> most HDD based ones has these things throttled down to not impede actual
> >> client I/O.
> >>
> >> To get a rough idea, take a look at:
> >> https://www.memset.com/tools/raid-calculator/
> >>
> >> For Ceph with replication 3 and the typical PG distribution, assume 100
> >> disks and the RAID6 with hotspares numbers are relevant.
> >> For rebuild speed, consult your experience, you must have had a few
> >> failures. ^o^
> >>
> >> For example with a recovery speed of 100MB/s, a 1TB disk (used data with
> >> Ceph actually) looks decent at 1:16000 DLO/y. 
> >> At 5TB though it enters scary land
> >>  
> > 
> > Yes, those recoveries will take a long time. Let's say your 6TB drive is filled for 80% you need to rebalance 4.8TB
> > 
> > 4.8TB / 100MB/sec = 13 hours rebuild time
> > 
> > 13 hours is a long time. And you will probably not have 100MB/sec sustained, I think that 50MB/sec is much more realistic.
> > 
> > That means that a single disk failure will take >24 hours to recover from a rebuild.
> > 
> > I don't like very big disks that much. Not in RAID, not in Ceph.  
> I don't think I am followinj the calculations. Maybe I need to provide a
> few more details on our current network configuration:
> each host (24 disks/osds) has 4 * 10 Gbit interfaces, 2 for client I/O
> and 2 for the recovery network.
> Rebalancing an OSD that was 50% full (2000GB) with the current setup
> tool a little less than 30 mins. It would still take 1.5 hour to
> rebalance 6 TB of data but that should still be reasonable,no?
> What am I overlooking here?
> 
We're playing devils advocate here, not knowing your configuration.
And most of all, if your cluster is busy or busier than usual, those times
will go up.

Your numbers suggest a recovery speed of around 1GB/s, which is very nice
and something I'd expect (hope) to see from such a large cluster. 

Plunging that into the calculator above with 5TB gives us a 1:6500 DLO/y,
not utterly frightening but also quite a bit lower than your current
example with 2TB at 1:40000.

> From our perspective having 9 * 8TB noded should provide a better
> recovery time than the current 24 * 4TB ones if a whole node goes down
> provide the rebalance is shared among several hundreds osds.
> 
You'll have 25% less data per node, but also 62.5% less OSDs per node.
If your whole cluster would consist of 9 OSD nodes it would recover
slower than, I'd presume. 

Also for most people it makes sense to set
the mon_osd_down_out_subtree_limit to host with a well monitored cluster,
giving that recovering the node can often be faster (and of course less
disruptive) than an automatic recovery and rebalance. 

Christan
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com