Re: Luminous - replace old target-weight tree from osdmap with mgr balancer

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 12 Jan 2018 20:21:30 +0000 (UTC)

On Fri, 12 Jan 2018, Stefan Priebe - Profihost AG wrote:
> Am 11.01.2018 um 21:37 schrieb Sage Weil:
> > On Thu, 11 Jan 2018, Stefan Priebe - Profihost AG wrote:
> >> OK it wasn't the balancer.
> >>
> >> It happens after executing all the reweight und crush compat commands:
> >>
> >> An even on a much bigger cluster it's 6% again. Some rounding issue? I
> >> migrated with a script so it's not a typo.
> > 
> > Maybe.. can you narrow down which command it is?  I'm guessing that one of 
> > the 'ceph osd crush weight-set reweight-compat ...' commands does it, but 
> > it would be nice to confirm whether it is a rounding issue or if something 
> > is broken!
> 
> Hi Sage,
> 
> it happens while executing:
> ceph osd crush weight-set reweight-compat <osd> <optimized-weight>
> ceph osd crush reweight <osd> <target-weight>
> right after the first command (reweight-compat optimized-weight) for the

It does this with the optimized-weight is *exactly* the same as the 
current (normal) weight?  If it matches it should be a no-op.  Can you do 
a 'ceph osd crush tree' before and after the command so we can compare?  
(In fact, I think that first step is pointless because when you create the 
compat weight-set it's populated with the regular CRUSH weights, which in 
your situation *are* the optimized weights.)

Actually, looking at this more closely, it looks like the normal 
'reweight' command sets the value in the compat weight-set too, so in 
reality we want to reorder those commands (and do them quickly in 
succession).  e.g.,

ceph osd crush reweight osd.1 <target-weight>
ceph osd crush weight-set reweight-compat osd.1 <optimized-weight>

but, again, the before and after PG layout should match.  A 'ceph osd 
crush tree' dump before, after, and between will help sort out what is 
going on.

Thanks!
sage

> 1st OSD it gets:
> 0.4%
> after the second (reweight target-weight):
> 0.7%
> 
> for each OSD it gets more worse...
> 
> Stefan
> 
> > sage
> >
> >  > 
> >> Stefan
> >>
> >> Am 11.01.2018 um 21:21 schrieb Stefan Priebe - Profihost AG:
> >>> Hi,
> >>>
> >>> Am 11.01.2018 um 21:10 schrieb Sage Weil:
> >>>> On Thu, 11 Jan 2018, Stefan Priebe - Profihost AG wrote:
> >>>>> Am 11.01.2018 um 20:58 schrieb Sage Weil:
> >>>>>> On Thu, 11 Jan 2018, Stefan Priebe - Profihost AG wrote:
> >>>>>>> Hi Sage,
> >>>>>>>
> >>>>>>> this did not work like expected. I tested it in another smaller cluster
> >>>>>>> and it resulted in about 6% misplaced objects.
> >>>>>>
> >>>>>> Can you narrow down at what stage the misplaced objects happened?
> >>>>>
> >>>>> ouch i saw this:
> >>>>> # ceph balancer status
> >>>>> {
> >>>>>     "active": true,
> >>>>>     "plans": [
> >>>>>         "auto_2018-01-11_19:52:28"
> >>>>>     ],
> >>>>>     "mode": "crush-compat"
> >>>>> }
> >>>>>
> >>>>> so might it be the balancer beeing executed while i was modifying the tree?
> >>>>> Can i stop it and reexecute it manually?
> >>>>
> >>>> You can always 'ceph balancer off'.  And I probably wouldn't turn it on 
> >>>> until after you've cleaned this up because it will balance with the 
> >>>> current weights being the 'target' weights (when in your case they're not 
> >>>> (yet)).
> >>>>
> >>>> To manually see what the balancer would do you can
> >>>>
> >>>>  ceph balancer optimize foo
> >>>>  ceph balancer show foo
> >>>>  ceph balancer eval foo   # (see numerical analysis)
> >>>>
> >>>> and if it looks good
> >>>>
> >>>>  ceph balancer execute foo
> >>>>
> >>>> to actually apply the changes.
> >>>
> >>> ok thanks but it seems there are still leftovers somewhere:
> >>> [expo-office-node1 ~]# ceph balancer optimize stefan
> >>> Error EINVAL: Traceback (most recent call last):
> >>>   File "/usr/lib/ceph/mgr/balancer/module.py", line 303, in handle_command
> >>>     self.optimize(plan)
> >>>   File "/usr/lib/ceph/mgr/balancer/module.py", line 596, in optimize
> >>>     return self.do_crush_compat(plan)
> >>>   File "/usr/lib/ceph/mgr/balancer/module.py", line 658, in do_crush_compat
> >>>     orig_ws = self.get_compat_weight_set_weights()
> >>>   File "/usr/lib/ceph/mgr/balancer/module.py", line 837, in
> >>> get_compat_weight_set_weights
> >>>     raise RuntimeError('could not find bucket %s' % b['bucket_id'])
> >>> RuntimeError: could not find bucket -6
> >>>
> >>> [expo-office-node1 ~]# ceph osd tree
> >>> ID CLASS WEIGHT   TYPE NAME                  STATUS REWEIGHT PRI-AFF
> >>> -1       14.55957 root default
> >>> -4        3.63989     host expo-office-node1
> >>>  8   ssd  0.90997         osd.8                  up  1.00000 1.00000
> >>>  9   ssd  0.90997         osd.9                  up  1.00000 1.00000
> >>> 10   ssd  0.90997         osd.10                 up  1.00000 1.00000
> >>> 11   ssd  0.90997         osd.11                 up  1.00000 1.00000
> >>> -2        3.63989     host expo-office-node2
> >>>  0   ssd  0.90997         osd.0                  up  1.00000 1.00000
> >>>  1   ssd  0.90997         osd.1                  up  1.00000 1.00000
> >>>  2   ssd  0.90997         osd.2                  up  1.00000 1.00000
> >>>  3   ssd  0.90997         osd.3                  up  1.00000 1.00000
> >>> -3        3.63989     host expo-office-node3
> >>>  4   ssd  0.90997         osd.4                  up  1.00000 1.00000
> >>>  5   ssd  0.90997         osd.5                  up  1.00000 1.00000
> >>>  6   ssd  0.90997         osd.6                  up  1.00000 1.00000
> >>>  7   ssd  0.90997         osd.7                  up  1.00000 1.00000
> >>> -5        3.63989     host expo-office-node4
> >>> 12   ssd  0.90997         osd.12                 up  1.00000 1.00000
> >>> 13   ssd  0.90997         osd.13                 up  1.00000 1.00000
> >>> 14   ssd  0.90997         osd.14                 up  1.00000 1.00000
> >>> 15   ssd  0.90997         osd.15                 up  1.00000 1.00000
> >>>
> >>>
> >>> Stefan
> >>>
> >>>> sage
> >>>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> sage
> >>>>>>
> >>>>>>>
> >>>>>>> Any ideas?
> >>>>>>>
> >>>>>>> Stefan
> >>>>>>> Am 11.01.2018 um 08:09 schrieb Stefan Priebe - Profihost AG:HI
> >>>>>>>> Thanks! Can this be done while still having jewel clients?
> >>>>>>>>
> >>>>>>>> Stefan
> >>>>>>>>
> >>>>>>>> Excuse my typo sent from my mobile phone.
> >>>>>>>>
> >>>>>>>> Am 10.01.2018 um 22:56 schrieb Sage Weil <sage@xxxxxxxxxxxx
> >>>>>>>> <mailto:sage@xxxxxxxxxxxx>>:
> >>>>>>>>
> >>>>>>>>> On Wed, 10 Jan 2018, Stefan Priebe - Profihost AG wrote:
> >>>>>>>>>> Am 10.01.2018 um 22:23 schrieb Sage Weil:
> >>>>>>>>>>> On Wed, 10 Jan 2018, Stefan Priebe - Profihost AG wrote:
> >>>>>>>>>>>> k,
> >>>>>>>>>>>>
> >>>>>>>>>>>> in the past we used the python crush optimize tool to reweight the osd
> >>>>>>>>>>>> usage - it inserted a 2nd tree with $hostname-target-weight as
> >>>>>>>>>>>> hostnames.
> >>>>>>>>>>>
> >>>>>>>>>>> Can you attach a 'ceph osd crush tree' (or partial output) so I can see
> >>>>>>>>>>> what you mean?
> >>>>>>>>>>
> >>>>>>>>>> Sure - attached.
> >>>>>>>>>
> >>>>>>>>> Got it
> >>>>>>>>>
> >>>>>>>>>>>> Now the quesions are:
> >>>>>>>>>>>> 1.) can we remove the tree? How?
> >>>>>>>>>>>> 2.) Can we do this now or only after all clients are running Luminous?
> >>>>>>>>>>>> 3.) is it enought to enable the mgr balancer module?
> >>>>>>>>>
> >>>>>>>>> First,
> >>>>>>>>>
> >>>>>>>>> ceph osd crush weight-set create-compat
> >>>>>>>>>
> >>>>>>>>> then for each osd,
> >>>>>>>>> ceph osd crush weight-set reweight-compat <osd> <optimized-weight>
> >>>>>>>>> ceph osd crush reweight <osd> <target-weight>
> >>>>>>>>>
> >>>>>>>>> That won't move any data but will keep your current optimized weights in
> >>>>>>>>> the compat weight-set where they belong.
> >>>>>>>>>
> >>>>>>>>> Then you can remove the *-target-weight buckets.  For each osd,
> >>>>>>>>>
> >>>>>>>>> ceph osd crush rm <osd> <ancestor>-target-weight
> >>>>>>>>>
> >>>>>>>>> and then for each remaining bucket
> >>>>>>>>>
> >>>>>>>>> ceph osd crush rm <foo>-target-weight
> >>>>>>>>>
> >>>>>>>>> Finally, turn on the balancer (or test it to see what it it wants to do
> >>>>>>>>> with the optimize command.)
> >>>>>>>>>
> >>>>>>>>> HTH!
> >>>>>>>>> sage
> >>>>>>>
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>