Re: OSDs missing from cluster all from one node

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2018/01/25 2:03 pm, Andre Goree wrote:
Yesterday I noticed some OSDs were missing from our cluster (96 OSDs
total, 84up/84in is what showed).

After drilling down to determine which node and the cause, I found
that all the OSDs on that node (12 total) were in fact down.

I entered 'systemctl status ceph-osd@$osd_number' to determine exactly
why they were down, and came up with:
Fail to open '/proc/0/cmdline' error = (2) No such file or directory
received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0
osd.72 1067 *** Got signal Interrupt ***
osd.72 1067 shutdown

This happened on all twelve OSDs (osd.72-osd.83).  On four, it
happened the previous evening around 9pm EST and the other eight
happened at roughly 2am EST the morning I discovered the issue
(discovered around 9am EST).

Has anyone ever come across something like this or perhaps know of a
fix?  This hasn't happened since, but this being a newly built-out
cluster it was a bit concerning.

Thanks in advance.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Responding from a diff email address because Outlook is a PITA.

So it appears this issue _has_ actually indeed happened again. Viewing the OSD log, I'm seeing the following (which I've Google'd and may or may not be a bug):

2018-01-24 23:54:45.889803 7f4416201700 0 -- 172.16.239.21:6808/22069164 >> 172.16.239.19:6806/2031213 conn(0x5586c8bef000 :6808 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING 2018-01-24 23:55:32.709816 7f44071fe700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.72 down, but it is still running 2018-01-24 23:55:32.709829 7f44071fe700 0 log_channel(cluster) log [DBG] : map e1809 wrongly marked me down at e1809 2018-01-24 23:55:32.709832 7f44071fe700 0 osd.72 1809 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down 2018-01-24 23:55:32.709838 7f44071fe700 1 osd.72 1809 start_waiting_for_healthy 2018-01-24 23:55:32.723353 7f44011f2700 1 osd.72 pg_epoch: 1809 pg[1.a58( empty local-lis/les=1803/1804 n=0 ec=675/675 lis/c 1803/1803 les/c/f 1804/1804/0 1809/1809/1653) [70,85] r=-1 lpr=1809 pi=[1803,1809)/1 crt=0'0 active] start_peering_interval up [70,85,72] -> [70,85], acting [70,85,72] -> [70,85], acting_primary 70 -> 70, up_primary 70 -> 70, role 2 -> -1,
features acting 2305244844532236283 upacting 2305244844532236283
...
...
...
2018-01-24 23:55:32.746608 7f44009f1700 1 osd.72 pg_epoch: 1809 pg[1.9b( empty local-lis/les=1803/1804 n=0 ec=671/671 lis/c 1803/1803 les/c/f 1804/1804/0 1809/1809/671) [62,95] r=-1 lpr=1809 pi=[1803,1809)/1 crt=0'0 unknown NOTIFY] state<Start>: transitioning to Stray 2018-01-24 23:55:32.746708 7f44071fe700 0 osd.72 1809 _committed_osd_maps shutdown OSD via async signal 2018-01-24 23:55:32.746794 7f43f89e1700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory 2018-01-24 23:55:32.746814 7f43f89e1700 -1 received signal: Interrupt from PID: 0 task name: <unknown> UID: 0 2018-01-24 23:55:32.746818 7f43f89e1700 -1 osd.72 1809 *** Got signal Interrupt *** 2018-01-24 23:55:32.746824 7f43f89e1700 0 osd.72 1809 prepare_to_stop starting shutdown
2018-01-24 23:55:32.746827 7f43f89e1700 -1 osd.72 1809 shutdown
2018-01-24 23:55:34.753347 7f43f89e1700 1 bluestore(/var/lib/ceph/osd/ceph-72) umount
2018-01-24 23:55:34.871899 7f43f89e1700  1 stupidalloc shutdown
2018-01-24 23:55:34.871913 7f43f89e1700  1 freelist shutdown
2018-01-24 23:55:34.871956 7f43f89e1700 4 rocksdb: [/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all background work 2018-01-24 23:55:34.877019 7f43f89e1700 4 rocksdb: [/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:343] Shutdown complete
2018-01-24 23:55:34.877245 7f43f89e1700  1 bluefs umount
2018-01-24 23:55:34.877254 7f43f89e1700  1 stupidalloc shutdown
2018-01-24 23:55:34.877256 7f43f89e1700  1 stupidalloc shutdown
2018-01-24 23:55:34.877257 7f43f89e1700  1 stupidalloc shutdown
2018-01-24 23:55:34.877296 7f43f89e1700 1 bdev(0x5586c3e73440 /var/lib/ceph/osd/ceph-72/block.wal) close 2018-01-24 23:55:35.148199 7f43f89e1700 1 bdev(0x5586c3e72fc0 /var/lib/ceph/osd/ceph-72/block.db) close 2018-01-24 23:55:35.376184 7f43f89e1700 1 bdev(0x5586c3e73200 /var/lib/ceph/osd/ceph-72/block) close 2018-01-24 23:55:35.556147 7f43f89e1700 1 bdev(0x5586c3e72d80 /var/lib/ceph/osd/ceph-72/block) close


I found this mailing list post (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021275.html) which pointed to what may be a bug (http://tracker.ceph.com/issues/20174).

But it's odd that this is happening on ONLY one of our 8 OSD nodes (each have 12 disks/OSDs). And I can issue a 'systemctl start ceph-sod@$osd_number' and it starts back up without issue.

Thoughts?

--
Andre Goree
-=-=-=-=-=-
Email     - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux