OSDs fail to start after stopping them with ceph osd stop command

Stefan Hanreich <s.hanreich@xxxxxxxxxxx> · Tue, 24 Jan 2023 16:02:43 +0100

We encountered the following problems while trying to perform 
maintenance on a Ceph cluster:

The cluster consists of 7 Nodes with 10 OSDs each.

There are 4 pools on it: 3 of them are replicated pools with 3/2 
size/min_size and one is an erasure coded pool with m=2 and k=5.

The following global flags were set:

 * noout
 * norebalance
 * nobackfill
 * norecover

Then, after those flags were set, all OSDs were stopped via the command 
ceph osd stop, which seems to have caused the issue.

After maintenance was done, all OSDs were started again via systemctl. 
Only about half of the 70 OSDs in total started at first - while the 
other half started, but got killed after a few seconds with the 
following log messages:

ceph-osd[197270]: 2023-01-24T13:39:12.103+0100 7ff3fcf8d700 -1 osd.51 
12161 map says i am stopped by admin. shutting down.
ceph-osd[197270]: 2023-01-24T13:39:12.103+0100 7ff40da55700 -1 received  
signal: Interrupt from Kernel ( Could be generated by pthread_kill(), 
raise(), abort(), alarm() ) UID: 0
ceph-osd[197270]: 2023-01-24T13:39:12.103+0100 7ff40da55700 -1 osd.51 
12161 *** Got signal Interrupt ***
ceph-osd[197270]: 2023-01-24T13:39:12.103+0100 7ff40da55700 -1 osd.51 
12161 *** Immediate shutdown (osd_fast_shutdown=true) ***

And indeed, when looking into the osd map via ceph osd dump, the 
remaining OSDs seem to be marked as stopped:

osd.50 down out weight 0 up_from 9213 up_thru 9416 down_at 9760 
last_clean_interval [9106,9207) 
[v2:10.0.1.61:6813/6211,v1:10.0.1.61:6818/6211] 
[v2:10.0.0.61:6814/6211,v1:10.0.0.61:6816/6211] exists,stop 
9a2590c4-f50b-4550-bfd1-5aafb543cb59

We were able to restore some of the remaining OSDs via running

ceph out osd XX
ceph in osd XX

and then starting the service again (via systemctl start). This did work 
for most OSDs, except for the OSDs that are located on one specific 
host. Some OSDs required several restarts until they did not kill 
themselves a few seconds after starting.

This whole issue seems to be caused by the OSDs being marked as stopped 
in the OSD map [1]. Apparently this state should get reset when 
re-starting the OSD again [2], but for some reason this doesn't happen 
for some of the OSDs. This behavior seems to have been introduced via 
the following pull request [3]. We have also found the following commit 
where the logic regarding stop seemed to have been introduced [4].

We were looking into commands that reset the stopped status of the OSD 
in the OSD map, but did not find any way of forcing this.

Since we are out of ideas on how to proceed with the remaining 10 OSDs 
that cannot get brought up: How does one recover from this situation? It 
seems like by running ceph osd stop the cluster got in a state that 
seems irrecoverable with the normal CLI commands available. We even 
looked into the possibility of manually manipulating the osdmap via the 
osdmaptool, but there doesn't seem to be a way to edit the start/stopped 
status and it also seems like a very invasive procedure. There does not 
seem to be any way we can see of recovering from this, apart from 
rebuilding all the OSDs - which we refrained from for now.

Kind Regards
Hanreich Stefan

[1] 
https://github.com/ceph/ceph/blob/63a77b2c5b683cb241f865daec92c046152175b4/src/osd/OSD.cc#L8240

[2] 
https://github.com/ceph/ceph/blob/63a77b2c5b683cb241f865daec92c046152175b4/src/osd/OSDMap.cc#L2353

[3] https://github.com/ceph/ceph/pull/43664

[4] 
https://github.com/ceph/ceph/commit/5dbae13ce0f5b0104ab43e0ccfe94f832d0e1268
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx