Re: enabling device-class crush rule moves all data

Dan Van Der Ster <daniel.vanderster@xxxxxxx> · Tue, 5 Jun 2018 13:15:59 +0000

On 13 Apr 2018, at 15:59, Sage Weil <sage@xxxxxxxxxxxx> wrote:

On Fri, 13 Apr 2018, Dan Van Der Ster wrote:

On 12 Apr 2018, at 17:44, Sage Weil <sage@xxxxxxxxxxxx> wrote:

On Thu, 12 Apr 2018, Dan Van Der Ster wrote:

Hi all,

We noticed that changing existing pools from being class-agnostic to 

choose a class (e.g. hdd [1]) then a lot, perhaps all, data will move.

I think this is because the crush IDs in the shadow tree are too 

different, or maybe just a different order, from the original bucket 

IDs.

But I wonder if we could use `ceph osd crush swap-bucket` (or a more 

elaborate version of swap bucket) to move the shadow buckets into the 

correct place in the tree, thereby preventing data movement.

Did anyone already ponder on this topic?

We could do a one-time transition that swaps the "real" ids with the 

shadow ids for one of the classes, and at the same time change the crush 

rule for the pool.

This will only really work/help if the majority of the devices are one 

class.  That's probably usually the case?

That's indeed our use-case. We have an all-hdd cluster for rgw -- we want to add *add* some ssd osds to each host in the cluster, then use the device class rules to map the bucket_data/index pools accordingly.

Do you expect no data movement if we use crushtool to decompile, swap 

the IDs, then recompile and setcrushmap?

Correct.  You can test it out with the osdmaptool --test-map-pgs on the 

before and after maps.  I think you'll want to edit the CRUSH rule to use 

the device class in place as well, and you'll probably need to create a 

rule using the class ahead of time so that all the shadow ids are there to 

swap with.

Let me know if there are problems!

(Replying to this old thread.)

It works!

-- Dan

P.S. We wrote this quick script to help out [1]. The basic procedure is:

Prepare the new ruleset and create a crushmap with swapped IDs:

  1. ceph osd crush rule create-replicated replicated_hdd default host hdd

  2. ceph osd getcrushmap | crushtool -d - -o crushmap.txt
  3. ceph-scripts/tools/device-class-id-swapper.py crushmap.txt   # creates crushmap.txt-new with bucket IDs swapped.
  4. diff crushmap.txt crushmap.txt-new #compare, sanity check
  5. crushtool -c crushmap.txt-new -o crushmapnew

Use the new hdd pool but don't move data: 
  1. ceph osd set nobackfill; ceph osd set norecover; ceph osd set norebalance

  2. ceph osd pool set test crush_rule replicated_hdd 
  3. Wait for re-peer -- should be lots of mispaced or degraded objects.

Inject the new crushmap, making the cluster HEALTH_OK again:

  1. ceph osd setcrushmap -i crushmapnew
  2. ceph osd unset nobackfill; ceph osd unset norecover; ceph osd unset norebalance

[1] https://github.com/cernceph/ceph-scripts/blob/master/tools/device-class-id-swapper.py

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com