On 2018/01/25 2:03 pm, Andre Goree wrote:
Yesterday I noticed some OSDs were missing from our cluster (96 OSDs
total, 84up/84in is what showed).
After drilling down to determine which node and the cause, I found
that all the OSDs on that node (12 total) were in fact down.
I entered 'systemctl status ceph-osd@$osd_number' to determine exactly
why they were down, and came up with:
Fail to open '/proc/0/cmdline' error = (2) No such file or directory
received signal: Interrupt from PID: 0 task name: <unknown> UID: 0
osd.72 1067 *** Got signal Interrupt ***
osd.72 1067 shutdown
This happened on all twelve OSDs (osd.72-osd.83). On four, it
happened the previous evening around 9pm EST and the other eight
happened at roughly 2am EST the morning I discovered the issue
(discovered around 9am EST).
Has anyone ever come across something like this or perhaps know of a
fix? This hasn't happened since, but this being a newly built-out
cluster it was a bit concerning.
Thanks in advance.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Responding from a diff email address because Outlook is a PITA.
So it appears this issue _has_ actually indeed happened again. Viewing
the OSD log, I'm seeing the following (which I've Google'd and may or
may not be a bug):
2018-01-24 23:54:45.889803 7f4416201700 0 --
172.16.239.21:6808/22069164 >> 172.16.239.19:6806/2031213
conn(0x5586c8bef000 :6808 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0
cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0
existing_state=STATE_CONNECTING
2018-01-24 23:55:32.709816 7f44071fe700 0 log_channel(cluster) log
[WRN] : Monitor daemon marked osd.72 down, but it is still running
2018-01-24 23:55:32.709829 7f44071fe700 0 log_channel(cluster) log
[DBG] : map e1809 wrongly marked me down at e1809
2018-01-24 23:55:32.709832 7f44071fe700 0 osd.72 1809
_committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last
600.000000 seconds, shutting down
2018-01-24 23:55:32.709838 7f44071fe700 1 osd.72 1809
start_waiting_for_healthy
2018-01-24 23:55:32.723353 7f44011f2700 1 osd.72 pg_epoch: 1809
pg[1.a58( empty local-lis/les=1803/1804 n=0 ec=675/675 lis/c 1803/1803
les/c/f 1804/1804/0 1809/1809/1653) [70,85] r=-1 lpr=1809
pi=[1803,1809)/1 crt=0'0 active] start_peering_interval up [70,85,72] ->
[70,85], acting [70,85,72] -> [70,85], acting_primary 70 -> 70,
up_primary 70 -> 70, role 2 -> -1,
features acting 2305244844532236283 upacting 2305244844532236283
...
...
...
2018-01-24 23:55:32.746608 7f44009f1700 1 osd.72 pg_epoch: 1809
pg[1.9b( empty local-lis/les=1803/1804 n=0 ec=671/671 lis/c 1803/1803
les/c/f 1804/1804/0 1809/1809/671) [62,95] r=-1 lpr=1809
pi=[1803,1809)/1 crt=0'0 unknown NOTIFY] state<Start>: transitioning to
Stray
2018-01-24 23:55:32.746708 7f44071fe700 0 osd.72 1809
_committed_osd_maps shutdown OSD via async signal
2018-01-24 23:55:32.746794 7f43f89e1700 -1 Fail to open
'/proc/0/cmdline' error = (2) No such file or directory
2018-01-24 23:55:32.746814 7f43f89e1700 -1 received signal: Interrupt
from PID: 0 task name: <unknown> UID: 0
2018-01-24 23:55:32.746818 7f43f89e1700 -1 osd.72 1809 *** Got signal
Interrupt ***
2018-01-24 23:55:32.746824 7f43f89e1700 0 osd.72 1809 prepare_to_stop
starting shutdown
2018-01-24 23:55:32.746827 7f43f89e1700 -1 osd.72 1809 shutdown
2018-01-24 23:55:34.753347 7f43f89e1700 1
bluestore(/var/lib/ceph/osd/ceph-72) umount
2018-01-24 23:55:34.871899 7f43f89e1700 1 stupidalloc shutdown
2018-01-24 23:55:34.871913 7f43f89e1700 1 freelist shutdown
2018-01-24 23:55:34.871956 7f43f89e1700 4 rocksdb:
[/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling
all background work
2018-01-24 23:55:34.877019 7f43f89e1700 4 rocksdb:
[/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:343] Shutdown complete
2018-01-24 23:55:34.877245 7f43f89e1700 1 bluefs umount
2018-01-24 23:55:34.877254 7f43f89e1700 1 stupidalloc shutdown
2018-01-24 23:55:34.877256 7f43f89e1700 1 stupidalloc shutdown
2018-01-24 23:55:34.877257 7f43f89e1700 1 stupidalloc shutdown
2018-01-24 23:55:34.877296 7f43f89e1700 1 bdev(0x5586c3e73440
/var/lib/ceph/osd/ceph-72/block.wal) close
2018-01-24 23:55:35.148199 7f43f89e1700 1 bdev(0x5586c3e72fc0
/var/lib/ceph/osd/ceph-72/block.db) close
2018-01-24 23:55:35.376184 7f43f89e1700 1 bdev(0x5586c3e73200
/var/lib/ceph/osd/ceph-72/block) close
2018-01-24 23:55:35.556147 7f43f89e1700 1 bdev(0x5586c3e72d80
/var/lib/ceph/osd/ceph-72/block) close
I found this mailing list post
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021275.html)
which pointed to what may be a bug
(http://tracker.ceph.com/issues/20174).
But it's odd that this is happening on ONLY one of our 8 OSD nodes (each
have 12 disks/OSDs). And I can issue a 'systemctl start
ceph-sod@$osd_number' and it starts back up without issue.
Thoughts?
--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website - http://blog.drenet.net
PGP key - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com