On Thu, Sep 27, 2018 at 9:57 PM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote: > > > > On 27/09/18 17:18, Dan van der Ster wrote: > > Dear Ceph friends, > > > > I have a CRUSH data migration puzzle and wondered if someone could > > think of a clever solution. > > > > Consider an osd tree like this: > > > > -2 4428.02979 room 0513-R-0050 > > -72 911.81897 rack RA01 > > -4 917.27899 rack RA05 > > -6 917.25500 rack RA09 > > -9 786.23901 rack RA13 > > -14 895.43903 rack RA17 > > -65 1161.16003 room 0513-R-0060 > > -71 578.76001 ipservice S513-A-IP38 > > -70 287.56000 rack BA09 > > -80 291.20001 rack BA10 > > -76 582.40002 ipservice S513-A-IP63 > > -75 291.20001 rack BA11 > > -78 291.20001 rack BA12 > > > > In the beginning, for reasons that are not important, we created two pools: > > * poolA chooses room=0513-R-0050 then replicates 3x across the racks. > > * poolB chooses room=0513-R-0060, replicates 2x across the > > ipservices, then puts a 3rd replica in room 0513-R-0050. > > > > For clarity, here is the crush rule for poolB: > > type replicated > > min_size 1 > > max_size 10 > > step take 0513-R-0060 > > step chooseleaf firstn 2 type ipservice > > step emit > > step take 0513-R-0050 > > step chooseleaf firstn -2 type rack > > step emit > > > > Now to the puzzle. > > For reasons that are not important, we now want to change the rule for > > poolB to put all three 3 replicas in room 0513-R-0060. > > And we need to do this in a way which is totally non-disruptive > > (latency-wise) to the users of either pools. (These are both *very* > > active RBD pools). > > > > I see two obvious ways to proceed: > > (1) simply change the rule for poolB to put a third replica on any > > osd in room 0513-R-0060. I'm afraid though that this would involve way > > too many concurrent backfills, cluster-wide, even with > > osd_max_backfills=1. > > (2) change poolB size to 2, then change the crush rule to that from > > (1), then reset poolB size to 3. This would risk data availability > > during the time that the pool is size=2, and also risks that every osd > > in room 0513-R-0050 would be too busy deleting for some indeterminate > > time period (10s of minutes, I expect). > > > > So I would probably exclude those two approaches. > > > > Conceptually what I'd like to be able to do is a gradual migration, > > which if I may invent some syntax on the fly... > > > > Instead of > > step take 0513-R-0050 > > do > > step weighted-take 99 0513-R-0050 1 0513-R-0060 > > > > That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1% > > of the time take room 0513-R-0060. > > With a mechanism like that, we could gradually adjust those "step > > weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060. > > > > I have a feeling that something equivalent to that is already possible > > with weight-sets or some other clever crush trickery. > > Any ideas? > > > > Best Regards, > > > > Dan > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > would it be possible in your case to create a parent datacenter bucket > to hold both rooms and assign their relative weights there, then for the > third replica do a step take to this parent bucket ? its not elegant but > may do the trick. Hey, that might work! both rooms are already in the default root: -1 5589.18994 root default -2 4428.02979 room 0513-R-0050 -65 1161.16003 room 0513-R-0060 -71 578.76001 ipservice S513-A-IP38 -76 582.40002 ipservice S513-A-IP63 so I'll play with a test pool and weighting down room 0513-R-0060 to see if this can work. Thanks! -- dan > The suggested step weighted-take would be more flexible as it can be > changed on a replica level, but i do not know if you can do this with > existing code. > > Maged > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com