I had a working ceph cluster running nautilus in a test lab just a few
months ago. Now that I'm trying to take ceph live on production
hardware, I can't seem to get the cluster to stay up and available even
though all three OSDs are UP and IN.
I believe the problem is that the OSDs don't mount their volumes after a
reboot. The ceph-deploy routine can install an OSD node, format the
disk and bring it online, and it can get all the OSD nodes UP and IN and
reach a quorum BUT, once an OSD gets rebooted, all the PGs related to
that OSD go "stuck inactive...current state unknown, last acting".
I've found and resolved all my hostname and firewall errors, and I'm
comfortable that I've ruled out network issues. For grins and giggles, I
reconfigured the OSDs to be on the same 'public' network with the MON
servers and the OSDs still drop their disks from the cluster after a reboot.
What do I need to do next?
Below is a pastebin link to some log file data where you can see some
traceback errors.
----
[2019-10-30 14:52:10,201][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py",
line 59, in newfunc
return f(*a, **kw)
----
Some of these errors might be due to the system seeing the three other
setup attempts that are no longer available. A 'ceph-deploy purge' and
'ceph-deploy purgedata' doesn't seem to get rid of EVERYTHING. I've
learned since that /var/lib/ceph retains some data. I'll be sure to
remove the data from that directory when I next attempt to start fresh.
What do I need to be looking at to correct this "OSD not remounting it's
disk" issue?
https://pastebin.com/NMXvYBcZ
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com