On Fri, 13 Apr 2018, Dan Van Der Ster wrote:
On 12 Apr 2018, at 17:44, Sage Weil <sage@xxxxxxxxxxxx> wrote:
On Thu, 12 Apr 2018, Dan Van Der Ster wrote:
Hi all,
We noticed that changing existing pools from being class-agnostic to
choose a class (e.g. hdd [1]) then a lot, perhaps all, data will move.
I think this is because the crush IDs in the shadow tree are too
different, or maybe just a different order, from the original bucket
IDs.
But I wonder if we could use `ceph osd crush swap-bucket` (or a more
elaborate version of swap bucket) to move the shadow buckets into the
correct place in the tree, thereby preventing data movement.
Did anyone already ponder on this topic?
We could do a one-time transition that swaps the "real" ids with the
shadow ids for one of the classes, and at the same time change the crush
rule for the pool.
This will only really work/help if the majority of the devices are one
class. That's probably usually the case?
That's indeed our use-case. We have an all-hdd cluster for rgw -- we want to add *add* some ssd osds to each host in the cluster, then use the device class rules to map the bucket_data/index pools accordingly.
Do you expect no data movement if we use crushtool to decompile, swap
the IDs, then recompile and setcrushmap?
Correct. You can test it out with the osdmaptool --test-map-pgs on the
before and after maps. I think you'll want to edit the CRUSH rule to use
the device class in place as well, and you'll probably need to create a
rule using the class ahead of time so that all the shadow ids are there to
swap with.
Let me know if there are problems!
(Replying to this old thread.)
It works!
-- Dan
P.S. We wrote this quick script to help out [1]. The basic procedure is:
Prepare the new ruleset and create a crushmap with swapped IDs:
1. ceph osd crush rule create-replicated replicated_hdd default host hdd
2. ceph osd getcrushmap | crushtool -d - -o crushmap.txt
3. ceph-scripts/tools/device-class-id-swapper.py crushmap.txt # creates crushmap.txt-new with bucket IDs swapped.
4. diff crushmap.txt crushmap.txt-new #compare, sanity check
5. crushtool -c crushmap.txt-new -o crushmapnew
Use the new hdd pool but don't move data:
1. ceph osd set nobackfill; ceph osd set norecover; ceph osd set norebalance
2. ceph osd pool set test crush_rule replicated_hdd
3. Wait for re-peer -- should be lots of mispaced or degraded objects.
Inject the new crushmap, making the cluster HEALTH_OK again:
1. ceph osd setcrushmap -i crushmapnew
2. ceph osd unset nobackfill; ceph osd unset norecover; ceph osd unset norebalance