Re: OSDs missing from cluster all from one node

Andre Goree <andre@xxxxxxxxxx> · Thu, 25 Jan 2018 14:47:40 -0500

On 2018/01/25 2:03 pm, Andre Goree wrote:
Yesterday I noticed some OSDs were missing from our cluster (96 OSDs
total, 84up/84in is what showed).

After drilling down to determine which node and the cause, I found
that all the OSDs on that node (12 total) were in fact down.

I entered 'systemctl status ceph-osd@$osd_number' to determine exactly
why they were down, and came up with:
Fail to open '/proc/0/cmdline' error = (2) No such file or directory
received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0
osd.72 1067 *** Got signal Interrupt ***
osd.72 1067 shutdown

This happened on all twelve OSDs (osd.72-osd.83).  On four, it
happened the previous evening around 9pm EST and the other eight
happened at roughly 2am EST the morning I discovered the issue
(discovered around 9am EST).

Has anyone ever come across something like this or perhaps know of a
fix?  This hasn't happened since, but this being a newly built-out
cluster it was a bit concerning.

Thanks in advance.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Responding from a diff email address because Outlook is a PITA.

So it appears this issue _has_ actually indeed happened again.  Viewing 
the OSD log, I'm seeing the following (which I've Google'd and may or 
may not be a bug):

2018-01-24 23:54:45.889803 7f4416201700  0 -- 
172.16.239.21:6808/22069164 >> 172.16.239.19:6806/2031213 
conn(0x5586c8bef000 :6808 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 
cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 
existing_state=STATE_CONNECTING
2018-01-24 23:55:32.709816 7f44071fe700  0 log_channel(cluster) log 
[WRN] : Monitor daemon marked osd.72 down, but it is still running
2018-01-24 23:55:32.709829 7f44071fe700  0 log_channel(cluster) log 
[DBG] : map e1809 wrongly marked me down at e1809
2018-01-24 23:55:32.709832 7f44071fe700  0 osd.72 1809 
_committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 
600.000000 seconds, shutting down
2018-01-24 23:55:32.709838 7f44071fe700  1 osd.72 1809 
start_waiting_for_healthy
2018-01-24 23:55:32.723353 7f44011f2700  1 osd.72 pg_epoch: 1809 
pg[1.a58( empty local-lis/les=1803/1804 n=0 ec=675/675 lis/c 1803/1803 
les/c/f 1804/1804/0 1809/1809/1653) [70,85] r=-1 lpr=1809
pi=[1803,1809)/1 crt=0'0 active] start_peering_interval up [70,85,72] -> 
[70,85], acting [70,85,72] -> [70,85], acting_primary 70 -> 70, 
up_primary 70 -> 70, role 2 -> -1,
features acting 2305244844532236283 upacting 2305244844532236283
...
...
...
2018-01-24 23:55:32.746608 7f44009f1700  1 osd.72 pg_epoch: 1809 
pg[1.9b( empty local-lis/les=1803/1804 n=0 ec=671/671 lis/c 1803/1803 
les/c/f 1804/1804/0 1809/1809/671) [62,95] r=-1 lpr=1809 
pi=[1803,1809)/1 crt=0'0 unknown NOTIFY] state<Start>: transitioning to 
Stray
2018-01-24 23:55:32.746708 7f44071fe700  0 osd.72 1809 
_committed_osd_maps shutdown OSD via async signal
2018-01-24 23:55:32.746794 7f43f89e1700 -1 Fail to open 
'/proc/0/cmdline' error = (2) No such file or directory
2018-01-24 23:55:32.746814 7f43f89e1700 -1 received  signal: Interrupt 
from  PID: 0 task name: <unknown> UID: 0
2018-01-24 23:55:32.746818 7f43f89e1700 -1 osd.72 1809 *** Got signal 
Interrupt ***
2018-01-24 23:55:32.746824 7f43f89e1700  0 osd.72 1809 prepare_to_stop 
starting shutdown
2018-01-24 23:55:32.746827 7f43f89e1700 -1 osd.72 1809 shutdown
2018-01-24 23:55:34.753347 7f43f89e1700  1 
bluestore(/var/lib/ceph/osd/ceph-72) umount
2018-01-24 23:55:34.871899 7f43f89e1700  1 stupidalloc shutdown
2018-01-24 23:55:34.871913 7f43f89e1700  1 freelist shutdown
2018-01-24 23:55:34.871956 7f43f89e1700  4 rocksdb: 
[/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling 
all background work
2018-01-24 23:55:34.877019 7f43f89e1700  4 rocksdb: 
[/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:343] Shutdown complete
2018-01-24 23:55:34.877245 7f43f89e1700  1 bluefs umount
2018-01-24 23:55:34.877254 7f43f89e1700  1 stupidalloc shutdown
2018-01-24 23:55:34.877256 7f43f89e1700  1 stupidalloc shutdown
2018-01-24 23:55:34.877257 7f43f89e1700  1 stupidalloc shutdown
2018-01-24 23:55:34.877296 7f43f89e1700  1 bdev(0x5586c3e73440 
/var/lib/ceph/osd/ceph-72/block.wal) close
2018-01-24 23:55:35.148199 7f43f89e1700  1 bdev(0x5586c3e72fc0 
/var/lib/ceph/osd/ceph-72/block.db) close
2018-01-24 23:55:35.376184 7f43f89e1700  1 bdev(0x5586c3e73200 
/var/lib/ceph/osd/ceph-72/block) close
2018-01-24 23:55:35.556147 7f43f89e1700  1 bdev(0x5586c3e72d80 
/var/lib/ceph/osd/ceph-72/block) close

I found this mailing list post 
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021275.html) 
which pointed to what may be a bug 
(http://tracker.ceph.com/issues/20174).

But it's odd that this is happening on ONLY one of our 8 OSD nodes (each 
have 12 disks/OSDs).  And I can issue a 'systemctl start 
ceph-sod@$osd_number' and it starts back up without issue.

Thoughts?

--
Andre Goree
-=-=-=-=-=-
Email     - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com