I have an update on the topic "OSD's fail to start after power loss". We
have fixed the issue. After our last "apt upgrade" procedure about 90
days ago, the package python-pkg-resources was removed via "apt
autoremove" after rebooting the OSD host. The command below shows that
the module pkg_resources was missing when running ceph-volume manually.
root@osd3:/root/# ceph-volume lvm activate --all
Traceback (most recent call last):
File "/usr/sbin/ceph-volume", line 6, in <module>
from pkg_resources import load_entry_point
ImportError: No module named pkg_resources
After installing python-pkg-resources, the above command succeeded, and
all 12 OSD's are now active in the cluster.
And Dominic, to answer your questions, I am running Ceph 14.2.2
(Nautilus) on Ubuntu 18.04. I used ceph-deploy to install the cluster.
The tmpfs directories /var/lib/ceph/osd/ceph-* were not mounted due to
the missing pkg_resources module, which caused the keyrings to be
unavailable.
Thanks Everyone,
Todd
On 10/13/21 4:21 PM, DHilsbos@xxxxxxxxxxxxxx wrote:
Todd;
What version of ceph are you running? Are you running containers or
packages? Was the cluster installed manually, or using a deployment tool?
Logs provided are for osd ID 31, is ID 31 appropriate for that server?
Have you verified that the ceph.conf on that server is intact, and
correct?
Your log snippet references /var/lib/ceph/osd/ceph-31/keyring; does
this file exist? Does the /var/lib/ceph/osd/ceph-31/ folder exist? If
both exist, are the ownership and permissions correct / appropriate?
Thank you,
Dominic L. Hilsbos, MBA
Vice President - Information Technology
Perform Air International Inc.
DHilsbos@xxxxxxxxxxxxxx
www.PerformAir.com
-----Original Message-----
From: Orbiting Code, Inc. [mailto:support@xxxxxxxxxxxxxxxx]
Sent: Wednesday, October 13, 2021 7:21 AM
To: ceph-users@xxxxxxx
Subject: OSD's fail to start after power loss
Hello Everyone,
I have 3 OSD hosts with 12 OSD's each. After a power failure on 1 host,
all 12 OSD's fail to start on that host. The other 2 hosts did not lose
power, and are functioning. Obviously I don't want to restart the
working hosts at this time. Syslog shows:
Oct 12 17:24:07 osd3 systemd[1]:
ceph-volume@lvm-31-cae13d9a-1d3d-4003-a57f-6ffac21a682e.service: Main
process exited, code
=exited, status=1/FAILURE
Oct 12 17:24:07 osd3 systemd[1]:
ceph-volume@lvm-31-cae13d9a-1d3d-4003-a57f-6ffac21a682e.service: Failed
with result 'exit-
code'.
Oct 12 17:24:07 osd3 systemd[1]: Failed to start Ceph Volume activation:
lvm-31-cae13d9a-1d3d-4003-a57f-6ffac21a682e.
This is repeated for all 12 OSD's on the failed host. Running the
following command, shows additional errors.
root@osd3:/var/log# /usr/bin/ceph-osd -f --cluster ceph --id 31
--setuser ceph --setgroup ceph
2021-10-12 17:50:23.117 7fce92e6ac00 -1 auth: unable to find a keyring
on /var/lib/ceph/osd/ceph-31/keyring: (2) No such file or directory
2021-10-12 17:50:23.117 7fce92e6ac00 -1 AuthRegistry(0x55c4ec50aa40) no
keyring found at /var/lib/ceph/osd/ceph-31/keyring, disabling cephx
2021-10-12 17:50:23.117 7fce92e6ac00 -1 auth: unable to find a keyring
on /var/lib/ceph/osd/ceph-31/keyring: (2) No such file or directory
2021-10-12 17:50:23.117 7fce92e6ac00 -1 AuthRegistry(0x7ffe9b64eb08) no
keyring found at /var/lib/ceph/osd/ceph-31/keyring, disabling cephx
failed to fetch mon config (--no-mon-config to skip)
No tmpfs mounts exist for any directories in /var/lib/ceph/osd/ceph-**
Any assistance helping with this situation would be greatly appreciated.
Thank you,
Todd
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx