On Wed, 16 Jul 2014, Gregory Farnum wrote: > On Wed, Jul 16, 2014 at 4:45 PM, Craig Lewis <clewis at centraldesktop.com> wrote: > > One of the things I've learned is that many small changes to the cluster are > > better than one large change. Adding 20% more OSDs? Don't add them all at > > once, trickle them in over time. Increasing pg_num & pgp_num from 128 to > > 1024? Go in steps, not one leap. > > > > I try to avoid operations that will touch more than 20% of the disks > > simultaneously. When I had journals on HDD, I tried to avoid going over 10% > > of the disks. > > > > > > Is there a way to execute `ceph osd crush tunables optimal` in a way that > > takes smaller steps? > > Unfortunately not; the crush tunables are changes to the core > placement algorithms at work. Well, there is one way, but it is only somewhat effective. If you decompile the crush maps for bobtail vs firefly the actual difference is tunable chooseleaf_vary_r 1 and this is written such that a value of 1 is the optimal 'new' way, 0 is the legacy old way, but values > 1 are less-painful steps between the two (though mostly closer to the firefly value of 1). So, you could set tunable chooseleaf_vary_r 4 wait for it to settle, and then do tunable chooseleaf_vary_r 3 ...and so forth down to 1. I did some limited testing of the data movement involved and noted it here: https://github.com/ceph/ceph/commit/37f840b499da1d39f74bfb057cf2b92ef4e84dc6 In my test case, going from 0 to 4 was about 1/10th as bad as going straight from 0 to 1, but the final step from 2 to 1 is still about 1/2 as bad. I'm not sure if that means it's not worth the trouble of not just jumping straight to the firefly tunables, or whether it means legacy users should just set (and leave) this at 2 or 3 or 4 and get almost all the benefit without the rebalance pain. sage