Re: Problem with OSDs

Brian Topping <brian.topping@xxxxxxxxx> · Mon, 21 Jan 2019 14:04:51 -0700

On Jan 21, 2019, at 6:47 AM, Alfredo Deza <adeza@xxxxxxxxxx> wrote:

When creating an OSD, ceph-volume will capture the ID and the FSID and
use these to create a systemd unit. When the system boots, it queries
LVM for devices that match that ID/FSID information.

Thanks Alfredo, I see that now. The name comes from the symlink and is passed into the script as %i. I should have seen that before, but at best I would have done a hacky job of recreating them manually, so in hindsight I’m glad I did not see that sooner.

Is it possible you've attempted to create an OSD and then failed, and
tried again? That would explain why there would be a systemd unit with
an FSID that doesn't match. By the output, it does look like
you have an OSD 1, but with a different FSID (467... instead of
e3b...). You could try to disable the failing systemd unit with:

   systemctl disable
ceph-volume@lvm-1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce.service

(Follow up with OSD 3) and then run:

   ceph-volume lvm activate --all

That worked and recovered startup of all four OSDs on the second node. In an abundance of caution, I only disabled one of the volumes with systemctl disable and then ran ceph-volume lvm activate --all. That cleaned up all of them though, so there was nothing left to do.

https://bugzilla.redhat.com/show_bug.cgi?id=1567346#c21 helped resolve the final issue getting to HEALTH_OK. After rebuilding the mon/mgr node, I did not properly clear / restore the firewall. It’s odd that osd tree was reporting that two of the OSDs were up and in when the ports for mon/mgr/mds were all inaccessible.

I don’t believe there were any failed creation attempts. Cardinal process rule with filesystems: Always maintain a known-good state that can be rolled back to. If an error comes up that can’t be fully explained, roll back and restart. Sometimes a command gets missed by the best of fingers and fully caffeinated minds.. :)  I do see that I didn’t do a `ceph osd purge` on the empty/downed OSDs that were gracefully `out`. That explains the tree with the even numbered OSDs on the rebuilt node. After purging the references to the empty OSDs and re-adding the volumes, I am back to full health with all devices and OSDs up/in.

THANK YOU!!! :D
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com