I never changed IDs before, I'm just extra cautious. If they do not show up explicitly anywhere else than inside the bucket definitions, then it is probably an easy edit and just swapping them. If you try this, could you please report back to the list if it works as expected, maybe with example crush maps/items included to illustrate the edits for documentation purposes? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Kyriazis, George <george.kyriazis@xxxxxxxxx> Sent: 05 June 2020 17:46 To: Frank Schilder Cc: ceph-users; Wido den Hollander Subject: Re: Best way to change bucket hierarchy Hmm, >From what I see in the crush map, “nodes” refers to other “nodes” by name, not by ID. In fact, I don’t see anything in the crush map referred to by ID. As we said before, though, the crush algorithm figures out the hashes based on the IDs. I am not sure what else refers to them, though (outside the crush map) to make sure the references are correct. Thanks, George > On Jun 5, 2020, at 10:32 AM, Frank Schilder <frans@xxxxxx> wrote: > > Wido replied to you, check this thread. > > You really need to understand the file you get exactly. The IDs are used to refer to items from within other items. You need to make sure that any such cross-reference is updated as well. It is not just changing the ID tag in a bucket item, you also need to update all places that refer to a bucket by ID. The crush map defines a tree structure and a wrong reference can get you into serious trouble. > > Before attempting anything like this, make sure you have a backup of the original crush map (in several places). > > Generally speaking, your tweaking of the crush map is maybe a bi premature. You wrote you want to add quite a number of servers. Why don't you do the crush map change together with that? All the data will be reshuffled then any ways. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Kyriazis, George <george.kyriazis@xxxxxxxxx> > Sent: 05 June 2020 17:21 > To: Frank Schilder > Cc: ceph-users; Wido den Hollander > Subject: Re: Best way to change bucket hierarchy > > Hmm, > > Sounds quite dangerous. On the other hand, and from prior experience, it could take weeks/months for the cluster to rebalance, so I give it a try. > > From the looks of it, there is no other reference to IDs, is that correct? Just swap IDs between chassis and host and I should be OK? (Sorry, I’m not following the list closely, so I am not aware of Wido’s procedure). > > Thanks, > > George > > >> On Jun 5, 2020, at 1:29 AM, Frank Schilder <frans@xxxxxx> wrote: >> >> Hi George, >> >> yes, I believe your interpretation is correct. because the chassis buckets have new bucket IDs, the distribution hashing will change. I also believe that the trick to avoid data movement in your situation is, to export the new crush map, swap the IDs between corresponding host and bucket in *all* (!!!) occurrences and import. This is possible because currently you have the special case of a one-to-one correspondence between hosts and chassis. >> >> This would be the procedure Wido explained and there is no other choice for this edit. >> >> If you want to do that depends on how far you are into the data movement. If its almost done, I wouldn't bother. If its another month, it might be worth trying. As far as I can see, your crush map is going to be a short text file, so it should be feasible to edit. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Kyriazis, George <george.kyriazis@xxxxxxxxx> >> Sent: 05 June 2020 01:36 >> To: Frank Schilder >> Cc: ceph-users >> Subject: Re: Best way to change bucket hierarchy >> >> Understand that it’s difficult to debug remotely. :-) >> >> In my current scenario I have 5 machines (1 host per chassis), but planning on adding some additional chassis with 4 hosts per chassis in the near future. Currently I am going through the first stage of adding “stub” chassis for the 5 hosts/chassis that I have, basically reparenting each host to its own chassis, as shown below: >> >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -1 203.72598 root default >> -5 40.01700 chassis chassis-hsw1 >> -9 40.01700 host vis-hsw-01 >> 3 hdd 10.91299 osd.3 up 1.00000 1.00000 >> 6 hdd 14.55199 osd.6 up 1.00000 1.00000 >> 10 hdd 14.55199 osd.10 up 1.00000 1.00000 >> -6 40.01700 chassis chassis-hsw2 >> -13 40.01700 host vis-hsw-02 >> 0 hdd 10.91299 osd.0 up 1.00000 1.00000 >> 7 hdd 14.55199 osd.7 up 1.00000 1.00000 >> 11 hdd 14.55199 osd.11 up 1.00000 1.00000 >> -7 40.01700 chassis chassis-hsw3 >> -11 40.01700 host vis-hsw-03 >> 4 hdd 10.91299 osd.4 up 1.00000 1.00000 >> 8 hdd 14.55199 osd.8 up 1.00000 1.00000 >> 12 hdd 14.55199 osd.12 up 1.00000 1.00000 >> -8 40.01700 chassis chassis-hsw4 >> -3 40.01700 host vis-hsw-04 >> 5 hdd 10.91299 osd.5 up 1.00000 1.00000 >> 9 hdd 14.55199 osd.9 up 1.00000 1.00000 >> 13 hdd 14.55199 osd.13 up 1.00000 1.00000 >> -17 43.65799 chassis chassis-hsw5 >> -15 43.65799 host vis-hsw-05 >> 1 hdd 14.55299 osd.1 up 1.00000 1.00000 >> 2 hdd 14.55299 osd.2 up 1.00000 1.00000 >> 14 hdd 14.55299 osd.14 up 1.00000 1.00000 >> >> There is no additional constraint that is being added, so ideally there would be no data movement. However, I can imagine that the CRUSH algorithm could hash the PGs into different OSDs now because there is a new thing to consider (namely the chassis). Does it do that? >> >> Thanks, >> >> George >> >> >> On Jun 4, 2020, at 6:22 PM, Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote: >> >> Its hard to tell without knowing what the diff is, but from your description I take it that you changed the failure domain for every(?) pool from host to chassis. I don't know what a chassis is in your architecture, but if each chassis contains several host buckets, then yes, I would expect almost every PG to be affected. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Kyriazis, George <george.kyriazis@xxxxxxxxx<mailto:george.kyriazis@xxxxxxxxx>> >> Sent: 05 June 2020 00:28:43 >> To: Frank Schilder >> Cc: ceph-users >> Subject: Re: Best way to change bucket hierarchy >> >> Hmm, >> >> So I tried all that, and I got almost all of my PGs being remapped. Crush map looks correct. Is that normal? >> >> Thanks, >> >> George >> >> >> On Jun 4, 2020, at 2:33 PM, Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx>> wrote: >> >> Hi George, >> >> you don't need to worry about that too much. The EC profile contains two types of information, one part about the actual EC encoding and another part about crush parameters. Unfortunately, actually. Part of this information is mutable after pool creation while the rest is not. Mutable here means outside of the profile. You can change the failure domain in the crush map without issues, but the profile won't reflect that change. That's an inconsistency we currently have to live with and it would have been better to separate mutable data (like failure domain) from immutable data (like k and m) or provide a meaningful interface to maintain consistency of mutable information. >> >> In short, don't believe everything the EC profile tells you. Some information might be out of date, like the failure domain or the device class (basically everything starting with crush-). If you remember that, you are out of trouble. Always dump the crush rule of an EC pool explicitly to see the true parameters in action. >> >> Having said that, to change the failure domain for an EC pool, change the crush rule for the EC profile - I did this too and it works just fine. The crush rule has by default the same name as the pool. I'm afraid, here you will have to do a manual edit of the crush rule as Wido explained. There is no other way - at least currently not. >> >> You can ask in this list for confirmation that your change is doing what you want. >> >> Do not try to touch an EC profile, they are read-only any ways. The crush parameters are only used at pool creation and never looked at again. You can override these by editing the crush rule as explained above. >> >> Best regards and good luck, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Kyriazis, George <george.kyriazis@xxxxxxxxx<mailto:george.kyriazis@xxxxxxxxx><mailto:george.kyriazis@xxxxxxxxx>> >> Sent: 04 June 2020 20:56:38 >> To: Frank Schilder >> Cc: ceph-users >> Subject: Re: Best way to change bucket hierarchy >> >> Thanks Frank, >> >> Interesting info about the EC profile. I do have an EC pool, but I noticed the following when I dumped the profile: >> >> # ceph osd erasure-code-profile get ec22 >> crush-device-class=hdd >> crush-failure-domain=host >> crush-root=default >> jerasure-per-chunk-alignment=false >> k=2 >> m=2 >> plugin=jerasure >> technique=reed_sol_van >> w=8 >> # >> >> Which says that the failure domain of the EC profile is also set to host. Looks like I need to change the EC profile, too, but since it associated with the pool, maybe I can’t do that after pool creation? Or…. Since it the property is named “crush-failure-domain”, it’s automatically inherited from the crush profile, so I don’t have to do anything? >> >> Thanks, >> >> George >> >> >> On Jun 4, 2020, at 1:51 AM, Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx><mailto:frans@xxxxxx>> wrote: >> >> Hi George, >> >> for replicated rules you can simply create a new crush rule with the new failure domain set to chassis and change any pool's crush rule to this new one. If you have EC pools, then the chooseleaf needs to be edited by hand. I did this before as well. (A really unfortunate side effect is, that the EC profile attached to the pool goes out of sync with the crush map and there is nothing one can do about that. This is annoying yet harmless.) >> >> The intend of doing these changes while norebalance is set is >> >> - to avoid unnecessary data movement due to successive changes happening step by step and >> - to make sure peering is successful before starting to move data. >> >> I believe OSDs peer a bit faster with norebalance set and there is then a shorter interrupt to ongoing I/O (no I/O happens to a PG during peering). >> >> Yes, if you safe the old crush map, you can undo everything. It is a good idea to have a backup also just for reference and to compare before and after. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Kyriazis, George <george.kyriazis@xxxxxxxxx<mailto:george.kyriazis@xxxxxxxxx><mailto:george.kyriazis@xxxxxxxxx><mailto:george.kyriazis@xxxxxxxxx>> >> Sent: 04 June 2020 00:58:20 >> To: Frank Schilder >> Cc: ceph-users >> Subject: Re: Best way to change bucket hierarchy >> >> Thanks Frank, >> >> I don’t have too much experience editing crush rules, but I assume the chooseleaf step would also have to change to: >> >> step chooseleaf firstn 0 type chassis >> >> Correct? Is that the only other change that is needed? It looks like the rule change can happen both inside and outside the “norebalance” setting (again with CLI commands), but is it safer to do it inside (ie. while not rebalancing)? >> >> If I keep a backup of the crush rule map (with “ceph osd getcrushmap”), I assume I can restore the old map if something goes bad? >> >> Thanks again! >> >> George >> >> >> >> On Jun 3, 2020, at 5:24 PM, Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx><mailto:frans@xxxxxx>> wrote: >> >> You can use the command-line without editing the crush map. Look at the documentation of commands like >> >> ceph osd crush add-bucket ... >> ceph osd crush move ... >> >> Before starting this, set "ceph osd set norebalance" and unset after you are happy with the crush tree. Let everything peer. You should see misplaced objects and remapped PGs, but no degraded objects or PGs. >> >> Do this only when cluster is helth_ok, otherwise things can get really complicated. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Kyriazis, George <george.kyriazis@xxxxxxxxx<mailto:george.kyriazis@xxxxxxxxx><mailto:george.kyriazis@xxxxxxxxx><mailto:george.kyriazis@xxxxxxxxx>> >> Sent: 03 June 2020 22:45:11 >> To: ceph-users >> Subject: Best way to change bucket hierarchy >> >> Helo, >> >> I have a live ceph cluster, and I’m in the need of modifying the bucket hierarchy. I am currently using the default crush rule (ie. keep each replica on a different host). My need is to add a “chassis” level, and keep replicas on a per-chassis level. >> >> From what I read in the documentation, I would have to edit the crush file manually, however this sounds kinda scary for a live cluster. >> >> Are there any “best known methods” to achieve that goal without messing things up? >> >> In my current scenario, I have one host per chassis, and planning on later adding nodes where there would be >1 hosts per chassis. It looks like “in theory” there wouldn’t be a need for any data movement after the crush map changes. Will reality match theory? Anything else I need to watch out for? >> >> Thank you! >> >> George >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx> >> >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx