Re: Removing an OSD node the right way

Frank Schilder <frans@xxxxxx> · Mon, 6 Dec 2021 09:25:52 +0000

Janne explained the reason for double-movement. There are 2 scenarios for out-phasing OSDs and both have different procedures:

1) Replace a broken disk/host with a new one

If you just want to replace a disk or host, the idea is to keep OSD IDs in the crush map to reduce movement to a minimum. There is even a special command for that: "osd destroy <osdname>", which keeps the ID intact. Then you can re-create the OSD with the same ID on the same host and just rebuild the disk/host without much data movement. There is not really a need to wait for full recovery before replacing the disk/host.

2) Outphasing OSDs altogether

I'm actually not sure if this really ever happens. I didn't hear of clusters shrinking in size. Hardware is usually replaced. If you really need to remove OSDs and will never add OSDs again, you can purge them from the cluster. However, this will leave holes in the sequence of OSD IDs and there was a bug with that discovered+fixed recently. Something worth avoiding.

If you need to re-purpose OSD IDs (new hosts have different disk count or are otherwise different), a better way is to create a special crush root as a parking lot and move the OSD IDs to this crush root until you re-deploy new OSDs. This way you avoid holes in the OSD sequence while also having only one data movement, because the crush map for the original root will be re-computed only once.

On our cluster I never delete OSD IDs, there is really never a reason. I either replace disks and assign the same IDs in the same crush location again, or move them to a parking space (second root) and re-use them later in new crush locations. All this is handled by ceph.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Sent: 05 December 2021 00:32
To: Janne Johansson
Cc: ceph-users
Subject:  Re: Removing an OSD node the right way

Maybe if you want to reprovision the osds on that node, it’s even faster if you set noout/norebalance, purge all the osd, recreate them, and unset the noout/rebalance so it would happen only 1 time rebalance. Not sure there is any issue with this.

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

On 2021. Dec 3., at 13:46, Janne Johansson <icepic.dz@xxxxxxxxx> wrote:

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Den fre 3 dec. 2021 kl 13:08 skrev huxiaoyu@xxxxxxxxxxxx
<huxiaoyu@xxxxxxxxxxxx>:
Dear Cephers,
I had to remove a failed OSD server node, and what i did is the following
1) First marked all OSDs on that (to be removed) server down and out
2) Secondly, let Ceph do backfilling and rebalancing, and wait for completing
3) Now i have full redundancy, so i delete thoses removed OSDs from the cluster, e.g. ceph osd cursh remove osd.${OSD_NUM}
4) To my surprise, after removing those already-out OSDs from the cluster, i was seeing a tons of PG remapped and once again BACKFILLING/REBALANCING

What is major problems of the above procedure, which caused double BACKFILLING/REBALANCING?  The root cause could be on those "already-out" OSDs but "not-yet being-removed" form CRUSH"? I previous thought those "out" OSDs would not impact CRUSH, but it seems i am wrong.

If it is still in the crush map with a crush weight, it will "claim"
that it brings space to the OSD host, even if there are no active PGs
on it. That also means that the first set of movements you did, were
more or less placing the PGs on temporary places (as if waiting for
the host to come back so they can return there later). When the host
finally goes away for real, the crush map is recalculated and PGs will
move, not just for the moved PGs but also because there now are fewer
hosts so PGs should relocate anyhow.

Consider having 4 hosts, and 12 PGS:

H1      H2      H3     H4
PG1   PG2   PG3   PG4
PG5  PG6   PG7    PG8
PG9   PG10 PG11  PG12

then you out PG 4,8 and 12 to remove Host4.

In this first case, when you out the OSDs, PG4,8 and 12 end up on
H1,H2 and H3 in some order just to "recover" them into full redundancy
as a repair operation, H4 is thought to exist, but doesn't carry any
active PGs.

H1     H2     H3     H4
PG1  PG2   PG3  -
PG5  PG6   PG7  -
PG9  PG10 PG11 -
PG4  PG8   PG12 -

When you finally make H4 go away, the new "normal" placement should probably be:

H1     H2     H3
PG1  PG2   PG3
PG4  PG5   PG6
PG7  PG8   PG9
PG10 PG11 PG12

which means that PG5 now needs to move from H1 to H2, PG6 from H2 to
H3, PG7 from H3 to H1 and so on.

This is of course my simplified view of the algorithms and how it
works, but in general this is how I experience it, and this shows why
PG5 moves after you completely remove H4, even if PG5 wasn't really
involved in H4 or the first movement round at all. So the crush when
you have a host will differ from when it is completely gone, which is
why you see two moves and where the second can be much larger.
In my example, PG4,8,12 placed themselves on the correct host but they
could as well have ended up 12,8,4 instead of 4,8,12 and then they
would move along with PG5, PG6 and so on.

On top of this, you have not just "a PG" but Replication=X copies, or
K+M EC shards of that one PG, not 12 PGs but hundreds or thousands.
You often have two or 20 OSDs on every OSD host, so in the final
recalculation of the crush when Host4 is gone, you might also see
movement of PGs within a host between OSDs, just because some PG is
now meant to be on another drive but still on the same host.

--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx