Ceph cluster works UNTIL the OSDs are rebooted

Richard Geoffrion <richard@xxxxxxxxxxx> · Thu, 14 Nov 2019 14:41:29 -0600

I had a working ceph cluster running nautilus in a test lab just a few 
months ago.   Now that I'm trying to take ceph live on production 
hardware, I can't seem to get the cluster to stay up and available even 
though all three OSDs are UP and IN.

I believe the problem is that the OSDs don't mount their volumes after a 
reboot.   The ceph-deploy routine can install an OSD node, format the 
disk and bring it online, and it can get all the OSD nodes UP and IN and 
reach a quorum BUT, once an OSD gets rebooted, all the PGs related to 
that OSD go "stuck inactive...current state unknown, last acting".

I've found and resolved all my hostname and firewall errors, and I'm 
comfortable that I've ruled out network issues. For grins and giggles, I 
reconfigured the OSDs to be on the same 'public' network with the MON 
servers and the OSDs still drop their disks from the cluster after a reboot.

What do I need to do next?

Below is a pastebin link to some log file data where you can see some 
traceback errors.

----
[2019-10-30 14:52:10,201][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", 
line 59, in newfunc
    return f(*a, **kw)
----

Some of these errors might be due to the system seeing the three other 
setup attempts that are no longer available. A 'ceph-deploy purge' and 
'ceph-deploy purgedata' doesn't seem to get rid of EVERYTHING. I've 
learned since that /var/lib/ceph retains some data.  I'll be sure to 
remove the data from that directory when I next attempt to start fresh.

What do I need to be looking at to correct this "OSD not remounting it's 
disk" issue?

https://pastebin.com/NMXvYBcZ
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com