ceph osd crush tunables optimal AND add new OSD at the same time

sweil@xxxxxxxxxx (Sage Weil) · Wed, 16 Jul 2014 17:06:52 -0700 (PDT)

On Wed, 16 Jul 2014, Gregory Farnum wrote:
> On Wed, Jul 16, 2014 at 4:45 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
> > One of the things I've learned is that many small changes to the cluster are
> > better than one large change.  Adding 20% more OSDs?  Don't add them all at
> > once, trickle them in over time.  Increasing pg_num & pgp_num from 128 to
> > 1024?  Go in steps, not one leap.
> >
> > I try to avoid operations that will touch more than 20% of the disks
> > simultaneously.  When I had journals on HDD, I tried to avoid going over 10%
> > of the disks.
> >
> >
> > Is there a way to execute `ceph osd crush tunables optimal` in a way that
> > takes smaller steps?
> 
> Unfortunately not; the crush tunables are changes to the core
> placement algorithms at work.

Well, there is one way, but it is only somewhat effective.  If you 
decompile the crush maps for bobtail vs firefly the actual difference is

 tunable chooseleaf_vary_r 1

and this is written such that a value of 1 is the optimal 'new' way, 0 is 
the legacy old way, but values > 1 are less-painful steps between the two 
(though mostly closer to the firefly value of 1).  So, you could set

 tunable chooseleaf_vary_r 4

wait for it to settle, and then do

 tunable chooseleaf_vary_r 3

...and so forth down to 1.  I did some limited testing of the data 
movement involved and noted it here:

 https://github.com/ceph/ceph/commit/37f840b499da1d39f74bfb057cf2b92ef4e84dc6

In my test case, going from 0 to 4 was about 1/10th as bad as going 
straight from 0 to 1, but the final step from 2 to 1 is still about 1/2 as 
bad.  I'm not sure if that means it's not worth the trouble of not just 
jumping straight to the firefly tunables, or whether it means legacy users 
should just set (and leave) this at 2 or 3 or 4 and get almost all the 
benefit without the rebalance pain.

sage