Re: CRUSH puzzle: step weighted-take

Luis Periquito <periquito@xxxxxxxxx> · Thu, 27 Sep 2018 17:33:49 +0100

I think your objective is to move the data without anyone else
noticing. What I usually do is reduce the priority of the recovery
process as much as possible. Do note this will make the recovery take
a looong time, and will also make recovery from failures slow...
ceph tell osd.* injectargs '--osd_recovery_sleep 0.9'
ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd_recovery_max_chunk 524288'

I would also assume you have set osd_scrub_during_recovery to false.

On Thu, Sep 27, 2018 at 4:19 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
> Dear Ceph friends,
>
> I have a CRUSH data migration puzzle and wondered if someone could
> think of a clever solution.
>
> Consider an osd tree like this:
>
>   -2       4428.02979     room 0513-R-0050
>  -72        911.81897         rack RA01
>   -4        917.27899         rack RA05
>   -6        917.25500         rack RA09
>   -9        786.23901         rack RA13
>  -14        895.43903         rack RA17
>  -65       1161.16003     room 0513-R-0060
>  -71        578.76001         ipservice S513-A-IP38
>  -70        287.56000             rack BA09
>  -80        291.20001             rack BA10
>  -76        582.40002         ipservice S513-A-IP63
>  -75        291.20001             rack BA11
>  -78        291.20001             rack BA12
>
> In the beginning, for reasons that are not important, we created two pools:
>   * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
>   * poolB chooses room=0513-R-0060, replicates 2x across the
> ipservices, then puts a 3rd replica in room 0513-R-0050.
>
> For clarity, here is the crush rule for poolB:
>         type replicated
>         min_size 1
>         max_size 10
>         step take 0513-R-0060
>         step chooseleaf firstn 2 type ipservice
>         step emit
>         step take 0513-R-0050
>         step chooseleaf firstn -2 type rack
>         step emit
>
> Now to the puzzle.
> For reasons that are not important, we now want to change the rule for
> poolB to put all three 3 replicas in room 0513-R-0060.
> And we need to do this in a way which is totally non-disruptive
> (latency-wise) to the users of either pools. (These are both *very*
> active RBD pools).
>
> I see two obvious ways to proceed:
>   (1) simply change the rule for poolB to put a third replica on any
> osd in room 0513-R-0060. I'm afraid though that this would involve way
> too many concurrent backfills, cluster-wide, even with
> osd_max_backfills=1.
>   (2) change poolB size to 2, then change the crush rule to that from
> (1), then reset poolB size to 3. This would risk data availability
> during the time that the pool is size=2, and also risks that every osd
> in room 0513-R-0050 would be too busy deleting for some indeterminate
> time period (10s of minutes, I expect).
>
> So I would probably exclude those two approaches.
>
> Conceptually what I'd like to be able to do is a gradual migration,
> which if I may invent some syntax on the fly...
>
> Instead of
>        step take 0513-R-0050
> do
>        step weighted-take 99 0513-R-0050 1 0513-R-0060
>
> That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
> of the time take room 0513-R-0060.
> With a mechanism like that, we could gradually adjust those "step
> weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.
>
> I have a feeling that something equivalent to that is already possible
> with weight-sets or some other clever crush trickery.
> Any ideas?
>
> Best Regards,
>
> Dan
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com