OSDs missing from cluster all from one node

Andre Goree <agoree@xxxxxxxxxxxxxxxxxx> · Thu, 25 Jan 2018 19:03:05 +0000

Yesterday I noticed some OSDs were missing from our cluster (96 OSDs total, 84up/84in is what showed).

After drilling down to determine which node and the cause, I found that all the OSDs on that node (12 total) were in fact down.

I entered 'systemctl status ceph-osd@$osd_number' to determine exactly why they were down, and came up with:
Fail to open '/proc/0/cmdline' error = (2) No such file or directory
received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0
osd.72 1067 *** Got signal Interrupt ***
osd.72 1067 shutdown

This happened on all twelve OSDs (osd.72-osd.83).  On four, it happened the previous evening around 9pm EST and the other eight happened at roughly 2am EST the morning I discovered the issue (discovered around 9am EST).

Has anyone ever come across something like this or perhaps know of a fix?  This hasn't happened since, but this being a newly built-out cluster it was a bit concerning.

Thanks in advance.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com