Re: What to expect on rejoining a host to cluster?

Frank Schilder <frans@xxxxxx> · Sun, 27 Nov 2022 12:27:13 +0000

Hi Matt,

if you didn't touch the OSDs on that host, they will join and only objects that have been modified will actually be updated. Ceph keeps some basic history information and can detect changes. 2 weeks is not a very long time. If you have a lot of cold data, re-integration will go fast.

Initially, you will see a huge amount of misplaced objects. However, this count will go down much faster than objects/s recovery.

Before you rejoin the host, I would fix its issues though. Now that you have it out of the cluster, do the maintenance first. There is no rush. In fact, you can buy a new host, install the OSDs in the new one and join that to the cluster with the host-name of the old host.

If you consider replacing the host and all disks, the get a new host first and give it the host name in the crush map. Just before you deploy the new host, simply purge all down OSDs in its bucket (set norebalance) and deploy. Then, the data movement is restricted to re-balancing to the new host.

If you just want to throw out the old host, destroy the OSDs but keep the IDs intact (ceph osd destroy). Then, no further re-balancing will happen and you can re-use the OSD ids later when adding a new host. That's a stable situation from an operations point of view.

Hope that helps.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Matt Larson <larsonmattr@xxxxxxxxx>
Sent: 26 November 2022 21:07:41
To: ceph-users
Subject:  What to expect on rejoining a host to cluster?

Hi all,

 I have had a host with 16 OSDs, each 14TB in capacity that started having
hardware issues causing it to crash.  I took this host down 2 weeks ago,
and the data rebalanced to the remaining 11 server hosts in the Ceph
cluster over this time period.

 My initial goal was to then remove the host completely from the cluster
with `ceph osd rm XX` and `ceph osd purge XX` (Adding/Removing OSDs — Ceph
Documentation
<https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/>).
However, I found that after the large amount of data migration from the
recovery, that the purge and removal from the crush map for an OSDs still
required another large data move.  It appears that it would have been a
better strategy to assign a 0 weight to an OSD to have only a single larger
data move instead of twice.

 I'd like to join the downed server back into the Ceph cluster.  It still
has 14 OSDs that are listed as out/down that would be brought back online.
My question is what can I expect if I bring this host online?  Will the
OSDs of a host that has been offline for an extended period of time and out
of the cluster have PGs that are now quite different or inconsistent?  Will
this be problematic?

 Thanks for any advice,
   Matt

--
Matt Larson, PhD
Madison, WI  53705 U.S.A.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx