Re: add multiple OSDs to cluster

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Tue, 21 Mar 2017 15:19:28 -0700

Deploying or removing OSD’s in parallel for sure can save elapsed time and avoid moving data more than once.  There are certain pitfalls, though, and the strategy needs careful planning.

- Deploying a new OSD at full weight means a lot of write operations.  Running multiple whole-OSD backfills to a single host can — depending on your situation — saturate the HBA, resulting in slow requests. 
- Judicious setting of norebalance/norecover can help somewhat, to give the affected OSD’s/ PG’s time to peer and become ready before shoving data at them
- Deploying at 0 CRUSH weight and incrementally ratcheting up the weight as PG’s peer can spread that out
- I’ve recently seen the idea of temporarily setting primary-affinity to 0 on the affected OSD’s to deflect some competing traffic as well
- One workaround is that if you have OSD’s to deploy on more than one server, you could deploy them in batches of say 1-2 on each server, striping them if you will.  That diffuses the impact and results in faster elapsed recovery

As for how many is safe to do in parallel, there are multiple variables there.  HDD vs SSD, client workload.  And especially how many other OSD’s are in the same logical rack/host.  On a cluster of 450 OSD’s, with 150 in each logical rack, each OSD is less than 1% of a rack, so deploying 4 of them at once would not be a massive change.  However in a smaller cluster with say 45 OSD’s, 15 in each rack, that would tickle a much larger fraction of the cluster and be more disruptive.

If the numbers below are TOTALS, if you would be expanding your cluster from a total of 4 OSD’s to a total of 8, that would be something I wouldn’t do, having experienced under Dumpling what it was like to triple the size of a certain cluster in one swoop.  

So one approach is trial and error to see how many you can get away with before you get slow requests, then backing off.  In production of course this is playing with fire. Depending on which release you’re running, cranking down a common set of backfill/recovery tunable can help mitigate the thundering herd effect as well.

— aad

> This morning I tried the careful approach, and added one OSD to server1. 
> It all went fine, everything rebuilt and I have a HEALTH_OK again now. 
> It took around 7 hours.
> 
> But now I started thinking... (and that's when things go wrong, 
> therefore hoping for feedback here....)
> 
> The question: was I being stupid to add only ONE osd to the server1? Is 
> it not smarter to add all four OSDs at the same time?
> 
> I mean: things will rebuild anyway...and I have the feeling that 
> rebuilding from 4 -> 8 OSDs is not going to be much heavier than 
> rebuilding from 4 -> 5 OSDs. Right?
> 
> So better add all new OSDs together on a specific server?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com