Re: [ceph-users] CRUSH puzzle: step weighted-take

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 28 Sep 2018 08:59:27 +0200

On Thu, Sep 27, 2018 at 6:34 PM Luis Periquito <periquito@xxxxxxxxx> wrote:
>
> I think your objective is to move the data without anyone else
> noticing. What I usually do is reduce the priority of the recovery
> process as much as possible. Do note this will make the recovery take
> a looong time, and will also make recovery from failures slow...
> ceph tell osd.* injectargs '--osd_recovery_sleep 0.9'
> ceph tell osd.* injectargs '--osd-max-backfills 1'
> ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
> ceph tell osd.* injectargs '--osd-client-op-priority 63'
> ceph tell osd.* injectargs '--osd-recovery-max-active 1'
> ceph tell osd.* injectargs '--osd_recovery_max_chunk 524288'
>
> I would also assume you have set osd_scrub_during_recovery to false.
>

Thanks Luis -- that will definitely be how we backfill if we go that
route. However I would prefer to avoid one big massive change that
takes a long time to complete.

- dan

>
>
> On Thu, Sep 27, 2018 at 4:19 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> >
> > Dear Ceph friends,
> >
> > I have a CRUSH data migration puzzle and wondered if someone could
> > think of a clever solution.
> >
> > Consider an osd tree like this:
> >
> >   -2       4428.02979     room 0513-R-0050
> >  -72        911.81897         rack RA01
> >   -4        917.27899         rack RA05
> >   -6        917.25500         rack RA09
> >   -9        786.23901         rack RA13
> >  -14        895.43903         rack RA17
> >  -65       1161.16003     room 0513-R-0060
> >  -71        578.76001         ipservice S513-A-IP38
> >  -70        287.56000             rack BA09
> >  -80        291.20001             rack BA10
> >  -76        582.40002         ipservice S513-A-IP63
> >  -75        291.20001             rack BA11
> >  -78        291.20001             rack BA12
> >
> > In the beginning, for reasons that are not important, we created two pools:
> >   * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
> >   * poolB chooses room=0513-R-0060, replicates 2x across the
> > ipservices, then puts a 3rd replica in room 0513-R-0050.
> >
> > For clarity, here is the crush rule for poolB:
> >         type replicated
> >         min_size 1
> >         max_size 10
> >         step take 0513-R-0060
> >         step chooseleaf firstn 2 type ipservice
> >         step emit
> >         step take 0513-R-0050
> >         step chooseleaf firstn -2 type rack
> >         step emit
> >
> > Now to the puzzle.
> > For reasons that are not important, we now want to change the rule for
> > poolB to put all three 3 replicas in room 0513-R-0060.
> > And we need to do this in a way which is totally non-disruptive
> > (latency-wise) to the users of either pools. (These are both *very*
> > active RBD pools).
> >
> > I see two obvious ways to proceed:
> >   (1) simply change the rule for poolB to put a third replica on any
> > osd in room 0513-R-0060. I'm afraid though that this would involve way
> > too many concurrent backfills, cluster-wide, even with
> > osd_max_backfills=1.
> >   (2) change poolB size to 2, then change the crush rule to that from
> > (1), then reset poolB size to 3. This would risk data availability
> > during the time that the pool is size=2, and also risks that every osd
> > in room 0513-R-0050 would be too busy deleting for some indeterminate
> > time period (10s of minutes, I expect).
> >
> > So I would probably exclude those two approaches.
> >
> > Conceptually what I'd like to be able to do is a gradual migration,
> > which if I may invent some syntax on the fly...
> >
> > Instead of
> >        step take 0513-R-0050
> > do
> >        step weighted-take 99 0513-R-0050 1 0513-R-0060
> >
> > That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
> > of the time take room 0513-R-0060.
> > With a mechanism like that, we could gradually adjust those "step
> > weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.
> >
> > I have a feeling that something equivalent to that is already possible
> > with weight-sets or some other clever crush trickery.
> > Any ideas?
> >
> > Best Regards,
> >
> > Dan
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com