OSDs won't come back after upgrade

Jorge Garcia <jgarcia@xxxxxxxxxxxx> · Wed, 8 Jan 2025 10:16:17 -0800

Hello,

I'm going down the long and winding road of upgrading our ceph
clusters from mimic to the latest version. This has involved slowly
going up one release at a time. I'm now going from octopus to pacific,
which also involves upgrading the OS on the host systems from Centos 7
to Rocky 9.

I first upgraded the monitors and managers, and those upgraded with no
problems. Now I'm upgrading the OSD servers, and I ran into some
issues that caused the first system to be down for a couple of days. I
finally got it back up, and got all the OSDs ready to come back
online, but whenever I try to bring the OSDs back up, they start
running for a bit, and it looks like the cluster is recovering and
catching up, but then the OSDs all go down again. The logs show some
messages like:

received  signal: Interrupt from Kernel ( Could be generated by
pthread_kill(), raise(), abort(), alarm() ) UID: 0
osd.10 254568 *** Got signal Interrupt ***
osd.10 254568 *** Immediate shutdown (osd_fast_shutdown=true) ***
osd.10 254568 prepare_to_stop starting shutdown

I found this thread:
https://www.spinics.net/lists/ceph-users/msg75628.html which seems to
be something similar, and they claim that the cluster needs to be
restarted many times in order for the OSDs to catch up to the current
epoch. I have restarted the OSDs many times, and now it's gotten to a
spot where there doesn't seem to be any progress. My questions are:

Is this the right solution?
Is there a way of seeing if some progress is happening with the OSDs?
Is there something else I should be trying?

Thanks for any help!

Jorge
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx