Re: Procedure for temporary evacuation and replacement

Frank Schilder <frans@xxxxxx> · Fri, 11 Oct 2024 08:07:30 +0000

Dear Wesley,

yes, the left-over PGs are something I need to avoid. Unfortunately, we have old clients connected and I cannot enable upmap. Therefore, I'm looking at the out-OSD approach. What you describe matches my interpretations of this piece of documentation: https://docs.ceph.com/en/reef/rados/operations/monitoring-osd-pg/?highlight=osd+states+out#monitoring-osds. OUT OSDs continue participating in IO, but PGs are migrated away. Only problem is that setting an OSD OUT might not be sticky. If the OSD reboots for some reason it might mark itself IN again.

I believe there was a config option that can prevent that (or a different osd out command that uses an ID). If you remember something like that, please let me know. If I find it, I will post it here.

Thanks for your review!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx>
Sent: Thursday, October 10, 2024 8:04 PM
To: Frank Schilder
Cc: Anthony D'Atri; ceph-users@xxxxxxx
Subject: Re:  Re: Procedure for temporary evacuation and replacement

If you are replacing the OSDs with the same size/weight device, I agree with your reweight approach. I've been doing some similar work myself that does require crush reweighting to 0 and have been in that headspace.

I did a bit of testing around this:

- Even with the lowest possible reweight an OSD would take, 1 PG was left on my up+in OSD  "ceph osd reweight osd.1 0.00002" results in a reweight of 0.00002 however a "ceph osd reweight osd.1 0.00001" results in the reweight of 0 (out).

- With my OSD in a state of UP + IN and a reweight of .00002 I used upmap to move that 1 PG off of the OSD to be left with 0 PGs there.

- I attempted to destroy the osd in this state but it complained it was not down and so I marked it down and set the noup flag

- With the PG in a down + in state the osd could be destroyed, and this surprised me, I assumed it would need to be marked (or transition to) out as well.

So in summary, I think you will be left with 1 or more PGs on the OSD in your approach of reweighting to a very low value you will then either need to later mark it fully out / reweight it to 0 or use upmap approach to not degrade that subsequent PG when getting it marked down.

I dont think there is any danger to reweighting to 0 (or marking it out) vs marking it to a very low value and as I have more clarity on what you want to do that is exactly the approach I would take (mark it out).

Respectfully,

Wes Dillingham
LinkedIn<http://www.linkedin.com/in/wesleydillingham>
wes@xxxxxxxxxxxxxxxxx<mailto:wes@xxxxxxxxxxxxxxxxx>

On Thu, Oct 10, 2024 at 9:58 AM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
Thanks Anthony and Wesley for your input.

Let me explain in more detail why I'm interested in the somewhat obscure looking procedure in step 1.

Whats the difference between "ceph osd reweight" and "ceph osd crush reweight"? the difference is that command 1 only remaps shards within the same failure domain (as Anthony noted), while command 2 implies global changes to the crush map with rediúndant data movement. In other words, using

  ceph osd reweight osd.X 0

will only move the shards from osd.X to other OSDs (in the same failue domain) while

  ceph osd crush reweight osd.X 0

has a global effect and will move a lot more around. This "a lot more" is what I want to avoid. There is necessary data movement, namely the data on the OSDs I want to evacuate, and there is redundant data movement, which is everything else.

So, for evacuation, the first command is the command of choice if one wants to move exactly the shards that need to move.

If one re-creates OSDs with exactly the same IDs and weights that the evacuated OSDs had, which is the default when using "ceph osd destroy" as it preserves the crush weights, then, after adding the new OSDs, it will be exactly the shards that were evacuated in step 1 that will move back. That's the minimum possible data movement: data moved = data that needs to move.

I don't have balancer or anything enabled that could interfere with that procedure. Please don't bother commenting about things like that.

My actual question is, how dangerous is it to use

  ceph osd reweight osd.X 0

instead of

  ceph osd reweight osd.X 0.001

The first command will mark the OSD OUT while the second won't. The second command might leave 1-2 PGs on the OSDs, while the first one won't.

Does the OSD being formally UP+OUT make any difference compared with UP+IN for evacuation? My initial simplistic test says no, but I would like to be a bit more sure than that.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx