Re: OSD's fail to start after power loss

"Orbiting Code, Inc." <support@xxxxxxxxxxxxxxxx> · Wed, 13 Oct 2021 16:42:00 -0400

I have an update on the topic "OSD's fail to start after power loss". We 
have fixed the issue. After our last "apt upgrade" procedure about 90 
days ago, the package python-pkg-resources was removed via "apt 
autoremove" after rebooting the OSD host. The command below shows that 
the module pkg_resources was missing when running ceph-volume manually.

root@osd3:/root/# ceph-volume lvm activate --all
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 6, in <module>
    from pkg_resources import load_entry_point
ImportError: No module named pkg_resources

After installing python-pkg-resources, the above command succeeded, and 
all 12 OSD's are now active in the cluster.

And Dominic, to answer your questions, I am running Ceph 14.2.2 
(Nautilus) on Ubuntu 18.04. I used ceph-deploy to install the cluster. 
The tmpfs directories /var/lib/ceph/osd/ceph-* were not mounted due to 
the missing pkg_resources module, which caused the keyrings to be 
unavailable.

Thanks Everyone,
Todd

On 10/13/21 4:21 PM, DHilsbos@xxxxxxxxxxxxxx wrote:
Todd;

What version of ceph are you running? Are you running containers or 
packages? Was the cluster installed manually, or using a deployment tool?

Logs provided are for osd ID 31, is ID 31 appropriate for that server? 
Have you verified that the ceph.conf on that server is intact, and 
correct?

Your log snippet references /var/lib/ceph/osd/ceph-31/keyring; does 
this file exist? Does the /var/lib/ceph/osd/ceph-31/ folder exist? If 
both exist, are the ownership and permissions correct / appropriate?

Thank you,

Dominic L. Hilsbos, MBA
Vice President - Information Technology
Perform Air International Inc.
DHilsbos@xxxxxxxxxxxxxx
www.PerformAir.com

-----Original Message-----
From: Orbiting Code, Inc. [mailto:support@xxxxxxxxxxxxxxxx]
Sent: Wednesday, October 13, 2021 7:21 AM
To: ceph-users@xxxxxxx
Subject:  OSD's fail to start after power loss

Hello Everyone,

I have 3 OSD hosts with 12 OSD's each. After a power failure on 1 host,
all 12 OSD's fail to start on that host. The other 2 hosts did not lose
power, and are functioning. Obviously I don't want to restart the
working hosts at this time. Syslog shows:

Oct 12 17:24:07 osd3 systemd[1]:
ceph-volume@lvm-31-cae13d9a-1d3d-4003-a57f-6ffac21a682e.service: Main
process exited, code
=exited, status=1/FAILURE
Oct 12 17:24:07 osd3 systemd[1]:
ceph-volume@lvm-31-cae13d9a-1d3d-4003-a57f-6ffac21a682e.service: Failed
with result 'exit-
code'.
Oct 12 17:24:07 osd3 systemd[1]: Failed to start Ceph Volume activation:
lvm-31-cae13d9a-1d3d-4003-a57f-6ffac21a682e.

This is repeated for all 12 OSD's on the failed host. Running the
following command, shows additional errors.

root@osd3:/var/log# /usr/bin/ceph-osd -f --cluster ceph --id 31
--setuser ceph --setgroup ceph
2021-10-12 17:50:23.117 7fce92e6ac00 -1 auth: unable to find a keyring
on /var/lib/ceph/osd/ceph-31/keyring: (2) No such file or directory
2021-10-12 17:50:23.117 7fce92e6ac00 -1 AuthRegistry(0x55c4ec50aa40) no
keyring found at /var/lib/ceph/osd/ceph-31/keyring, disabling cephx
2021-10-12 17:50:23.117 7fce92e6ac00 -1 auth: unable to find a keyring
on /var/lib/ceph/osd/ceph-31/keyring: (2) No such file or directory
2021-10-12 17:50:23.117 7fce92e6ac00 -1 AuthRegistry(0x7ffe9b64eb08) no
keyring found at /var/lib/ceph/osd/ceph-31/keyring, disabling cephx
failed to fetch mon config (--no-mon-config to skip)

No tmpfs mounts exist for any directories in /var/lib/ceph/osd/ceph-**

Any assistance helping with this situation would be greatly appreciated.

Thank you,
Todd
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx