Re: Undo ceph osd destroy?

Michael Fladischer <michael@xxxxxxxx> · Sun, 30 Aug 2020 20:34:48 +0200

Hi Eugen,

Am 26.08.2020 um 11:47 schrieb Eugen Block:
I don't know if the ceph version is relevant here but I could undo that 
quite quickly in my small test cluster (Octopus native, no docker).
After the OSD was marked as "destroyed" I recreated the auth caps for 
that OSD_ID (marking as destroyed removes cephx keys etc.), changed the 
keyring in /var/lib/ceph/osd/ceph-1/keyring to reflect that and 
restarted the OSD, now it's up and in again. Is the OSD in your case 
actually up and running?

my cluster is running Octopus too, and your hint regarding the auth caps 
put me on the right track to get the OSD back online.

For anyone who ends up in the same situation, here is what I did 
(assuming osd.95 is the destroyed OSD and only `ceph osd destroy ...` 
was invoked, no `ceph-volume lvm zap ...`). Commands should be run on 
the node where the destroyed OSD resides:

1. I made a copy of the keyring file of the OSD:

	cp /var/lib/ceph/osd/ceph-95/keyring ~/keyring.osd.95

2. Add the following "caps" lines to the copy ~/keyring.osd.95, so that 
the file looks like this in the end (leave your key intact):

	[osd.95]
        	key = <osd-key>
	        caps mgr = "allow profile osd"
        	caps mon = "allow profile osd"
	        caps osd = "allow *"

3. Now reimport the OSD keyring file:

	ceph auth import -i ~/keyring.osd.95

4. Create new OSD, replacing the destroyed one:

	ceph osd new $(cat /var/lib/ceph/osd/ceph-95/fsid) 95

5. Start the OSD again:

	systemctl start ceph-osd@95.service

Now the OSD should rejoin the cluster and everything should be back to 
normal. At least it did for me, fixing my "incomplete PG"  issue.

Regards,
Michael

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx