Thanks all for the advice, very helpful! The node also had a mon, which happily slotted right back into the cluster. The node's been up and running for a number of days now, but the systemd OSD processes don't seem to be trying continously, they're never progressing or getting a newer map. As mentioned, the cluster is otherwise healthy (only these OSDs, which are down and out), and I have spare capacity and no issue with min_size. And they've been out for a long time (months) so it's reasonable to guess that most PGs may have been touched. So, based on the advice, my plan is the following: 1. Set norebalance 2. One by one, do this for each OSD * Purge the OSD from the dashboard * cephadm ceph-volume lvm zap * cephadm may automatically find and add the OSD, otherwise I'll add it manually 3. use pgremapper<https://github.com/digitalocean/pgremapper> to prevent the OSDs to be filled 4. unset norebalance 5. Let the balancer gently flow data back into the OSDs over the next hours, days, weeks. Thanks all! ________________________________ From: Richard Bade 'hitrich at gmail.com' <ceph-mail@xxxxxxxxxxxxxxxx> Sent: Thursday, September 7, 2023 01:25 To: ceph-mail@xxxxxxxxxxxxxxxx <ceph-mail@xxxxxxxxxxxxxxxx> Subject: Re: Re: Is it possible (or meaningful) to revive old OSDs? Yes, I agree with Anthony. If your cluster is healthy and you don't *need* to bring them back in it's going to be less work and time to just deploy them as new. I usually set norebalance, purge the osds in ceph, remove the vg from the disks and re-deploy. Then unset norebalance at the end once everything is peered and happy. This is so that it doesn't start moving stuff around when you purge. Rich On Thu, 7 Sept 2023 at 02:21, Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote: > > Resurrection usually only makes sense if fate or a certain someone resulted in enough overlapping removed OSDs that you can't meet min_size. I've had to a couple of times :-/ > > If an OSD is down for more than a short while, backfilling a redeployed OSD will likely be faster than waiting for it to peer and do deltas -- if it can at all. > > > On Sep 6, 2023, at 10:16, Malte Stroem <malte.stroem@xxxxxxxxx> wrote: > > > > Hi ceph-mail@xxxxxxxxxxxxxxxx, > > > > you could squeeze the OSDs back in but it does not make sense. > > > > Just clean the disks with dd for example and add them as new disks to your cluster. > > > > Best, > > Malte > > > > Am 04.09.23 um 09:39 schrieb ceph-mail@xxxxxxxxxxxxxxxx: > >> Hello, > >> I have a ten node cluster with about 150 OSDs. One node went down a while back, several months. The OSDs on the node have been marked as down and out since. > >> I am now in the position to return the node to the cluster, with all the OS and OSD disks. When I boot up the now working node, the OSDs do not start. > >> Essentially , it seems to complain with "fail[ing]to load OSD map for [various epoch]s, got 0 bytes". > >> I'm guessing the OSDs on disk maps are so old, they can't get back into the cluster? > >> My questions are whether it's possible or worth it to try to squeeze these OSDs back in or to just replace them. And if I should just replace them, what's the best way? Manually remove [1] and recreate? Replace [2]? Purge in dashboard? > >> [1] https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/#removing-osds-manual > >> [2] https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/#replacing-an-osd > >> Many thanks! > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx