Re: Best way to change bucket hierarchy

Frank Schilder <frans@xxxxxx> · Fri, 5 Jun 2020 06:29:06 +0000

Hi George,

yes, I believe your interpretation is correct. because the chassis buckets have new bucket IDs, the distribution hashing will change. I also believe that the trick to avoid data movement in your situation is, to export the new crush map, swap the IDs between corresponding host and bucket in *all* (!!!) occurrences and import. This is possible because currently you have the special case of a one-to-one correspondence between hosts and chassis.

This would be the procedure Wido explained and there is no other choice for this edit.

If you want to do that depends on how far you are into the data movement. If its almost done, I wouldn't bother. If its another month, it might be worth trying. As far as I can see, your crush map is going to be a short text file, so it should be feasible to edit.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Kyriazis, George <george.kyriazis@xxxxxxxxx>
Sent: 05 June 2020 01:36
To: Frank Schilder
Cc: ceph-users
Subject: Re: Best way to change bucket hierarchy

Understand that it’s difficult to debug remotely. :-)

In my current scenario I have 5 machines (1 host per chassis), but planning on adding some additional chassis with 4 hosts per chassis in the near future.  Currently I am going through the first stage of adding “stub” chassis for the 5 hosts/chassis that I have, basically reparenting each host to its own chassis, as shown below:

ID  CLASS WEIGHT    TYPE NAME                STATUS REWEIGHT PRI-AFF
 -1       203.72598 root default
 -5        40.01700     chassis chassis-hsw1
 -9        40.01700         host vis-hsw-01
  3   hdd  10.91299             osd.3            up  1.00000 1.00000
  6   hdd  14.55199             osd.6            up  1.00000 1.00000
 10   hdd  14.55199             osd.10           up  1.00000 1.00000
 -6        40.01700     chassis chassis-hsw2
-13        40.01700         host vis-hsw-02
  0   hdd  10.91299             osd.0            up  1.00000 1.00000
  7   hdd  14.55199             osd.7            up  1.00000 1.00000
 11   hdd  14.55199             osd.11           up  1.00000 1.00000
 -7        40.01700     chassis chassis-hsw3
-11        40.01700         host vis-hsw-03
  4   hdd  10.91299             osd.4            up  1.00000 1.00000
  8   hdd  14.55199             osd.8            up  1.00000 1.00000
 12   hdd  14.55199             osd.12           up  1.00000 1.00000
 -8        40.01700     chassis chassis-hsw4
 -3        40.01700         host vis-hsw-04
  5   hdd  10.91299             osd.5            up  1.00000 1.00000
  9   hdd  14.55199             osd.9            up  1.00000 1.00000
 13   hdd  14.55199             osd.13           up  1.00000 1.00000
-17        43.65799     chassis chassis-hsw5
-15        43.65799         host vis-hsw-05
  1   hdd  14.55299             osd.1            up  1.00000 1.00000
  2   hdd  14.55299             osd.2            up  1.00000 1.00000
 14   hdd  14.55299             osd.14           up  1.00000 1.00000

There is no additional constraint that is being added, so ideally there would be no data movement.  However, I can imagine that the CRUSH algorithm could hash the PGs into different OSDs now because there is a new thing to consider (namely the chassis).  Does it do that?

Thanks,

George

On Jun 4, 2020, at 6:22 PM, Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:

Its hard to tell without knowing what the diff is, but from your description I take it that you changed the failure domain for every(?) pool from host to chassis. I don't know what a chassis is in your architecture, but if each chassis contains several host buckets, then yes, I would expect almost every PG to be affected.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Kyriazis, George <george.kyriazis@xxxxxxxxx<mailto:george.kyriazis@xxxxxxxxx>>
Sent: 05 June 2020 00:28:43
To: Frank Schilder
Cc: ceph-users
Subject: Re: Best way to change bucket hierarchy

Hmm,

So I tried all that, and I got almost all of my PGs being remapped.  Crush map looks correct.  Is that normal?

Thanks,

George

On Jun 4, 2020, at 2:33 PM, Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx>> wrote:

Hi George,

you don't need to worry about that too much. The EC profile contains two types of information, one part about the actual EC encoding and another part about crush parameters. Unfortunately, actually. Part of this information is mutable after pool creation while the rest is not. Mutable here means outside of the profile. You can change the failure domain in the crush map without issues, but the profile won't reflect that change. That's an inconsistency we currently have to live with and it would have been better to separate mutable data (like failure domain) from immutable data (like k and m) or provide a meaningful interface to maintain consistency of mutable information.

In short, don't believe everything the EC profile tells you. Some information might be out of date, like the failure domain or the device class (basically everything starting with crush-). If you remember that, you are out of trouble. Always dump the crush rule of an EC pool explicitly to see the true parameters in action.

Having said that, to change the failure domain for an EC pool, change the crush rule for the EC profile - I did this too and it works just fine. The crush rule has by default the same name as the pool. I'm afraid, here you will have to do a manual edit of the crush rule as Wido explained. There is no other way - at least currently not.

You can ask in this list for confirmation that your change is doing what you want.

Do not try to touch an EC profile, they are read-only any ways. The crush parameters are only used at pool creation and never looked at again. You can override these by editing the crush rule as explained above.

Best regards and good luck,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Kyriazis, George <george.kyriazis@xxxxxxxxx<mailto:george.kyriazis@xxxxxxxxx><mailto:george.kyriazis@xxxxxxxxx>>
Sent: 04 June 2020 20:56:38
To: Frank Schilder
Cc: ceph-users
Subject: Re: Best way to change bucket hierarchy

Thanks Frank,

Interesting info about the EC profile.  I do have an EC pool, but I noticed the following when I dumped the profile:

# ceph osd erasure-code-profile get ec22
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=2
plugin=jerasure
technique=reed_sol_van
w=8
#

Which says that the failure domain of the EC profile is also set to host.  Looks like I need to change the EC profile, too, but since it associated with the pool, maybe I can’t do that after pool creation?  Or…. Since it the property is named “crush-failure-domain”, it’s automatically inherited from the crush profile, so I don’t have to do anything?

Thanks,

George

On Jun 4, 2020, at 1:51 AM, Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx><mailto:frans@xxxxxx>> wrote:

Hi George,

for replicated rules you can simply create a new crush rule with the new failure domain set to chassis and change any pool's crush rule to this new one. If you have EC pools, then the chooseleaf needs to be edited by hand. I did this before as well. (A really unfortunate side effect is, that the EC profile attached to the pool goes out of sync with the crush map and there is nothing one can do about that. This is annoying yet harmless.)

The intend of doing these changes while norebalance is set is

- to avoid unnecessary data movement due to successive changes happening step by step and
- to make sure peering is successful before starting to move data.

I believe OSDs peer a bit faster with norebalance set and there is then a shorter interrupt to ongoing I/O (no I/O happens to a PG during peering).

Yes, if you safe the old crush map, you can undo everything. It is a good idea to have a backup also just for reference and to compare before and after.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Kyriazis, George <george.kyriazis@xxxxxxxxx<mailto:george.kyriazis@xxxxxxxxx><mailto:george.kyriazis@xxxxxxxxx><mailto:george.kyriazis@xxxxxxxxx>>
Sent: 04 June 2020 00:58:20
To: Frank Schilder
Cc: ceph-users
Subject: Re: Best way to change bucket hierarchy

Thanks Frank,

I don’t have too much experience editing crush rules, but I assume the chooseleaf step would also have to change to:

     step chooseleaf firstn 0 type chassis

Correct?  Is that the only other change that is needed?  It looks like the rule change can happen both inside and outside the “norebalance” setting (again with CLI commands), but is it safer to do it inside (ie. while not rebalancing)?

If I keep a backup of the crush rule map (with “ceph osd getcrushmap”), I assume I can restore the old map if something goes bad?

Thanks again!

George

On Jun 3, 2020, at 5:24 PM, Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx><mailto:frans@xxxxxx>> wrote:

You can use the command-line without editing the crush map. Look at the documentation of commands like

ceph osd crush add-bucket ...
ceph osd crush move ...

Before starting this, set "ceph osd set norebalance" and unset after you are happy with the crush tree. Let everything peer. You should see misplaced objects and remapped PGs, but no degraded objects or PGs.

Do this only when cluster is helth_ok, otherwise things can get really complicated.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Kyriazis, George <george.kyriazis@xxxxxxxxx<mailto:george.kyriazis@xxxxxxxxx><mailto:george.kyriazis@xxxxxxxxx><mailto:george.kyriazis@xxxxxxxxx>>
Sent: 03 June 2020 22:45:11
To: ceph-users
Subject:  Best way to change bucket hierarchy

Helo,

I have a live ceph cluster, and I’m in the need of modifying the bucket hierarchy.  I am currently using the default crush rule (ie. keep each replica on a different host).  My need is to add a “chassis” level, and keep replicas on a per-chassis level.

>From what I read in the documentation, I would have to edit the crush file manually, however this sounds kinda scary for a live cluster.

Are there any “best known methods” to achieve that goal without messing things up?

In my current scenario, I have one host per chassis, and planning on later adding nodes where there would be >1 hosts per chassis. It looks like “in theory” there wouldn’t be a need for any data movement after the crush map changes.  Will reality match theory? Anything else I need to watch out for?

Thank you!

George

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx