On Fri, Jan 26, 2018 at 5:47 AM, Andre Goree <andre@xxxxxxxxxx> wrote: > On 2018/01/25 2:03 pm, Andre Goree wrote: >> >> Yesterday I noticed some OSDs were missing from our cluster (96 OSDs >> total, 84up/84in is what showed). >> >> After drilling down to determine which node and the cause, I found >> that all the OSDs on that node (12 total) were in fact down. >> >> I entered 'systemctl status ceph-osd@$osd_number' to determine exactly >> why they were down, and came up with: >> Fail to open '/proc/0/cmdline' error = (2) No such file or directory >> received signal: Interrupt from PID: 0 task name: <unknown> UID: 0 >> osd.72 1067 *** Got signal Interrupt *** >> osd.72 1067 shutdown >> >> This happened on all twelve OSDs (osd.72-osd.83). On four, it >> happened the previous evening around 9pm EST and the other eight >> happened at roughly 2am EST the morning I discovered the issue >> (discovered around 9am EST). >> >> Has anyone ever come across something like this or perhaps know of a >> fix? This hasn't happened since, but this being a newly built-out >> cluster it was a bit concerning. >> >> Thanks in advance. >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > Responding from a diff email address because Outlook is a PITA. > > So it appears this issue _has_ actually indeed happened again. Viewing the > OSD log, I'm seeing the following (which I've Google'd and may or may not be > a bug): > > 2018-01-24 23:54:45.889803 7f4416201700 0 -- 172.16.239.21:6808/22069164 >> > 172.16.239.19:6806/2031213 conn(0x5586c8bef000 :6808 > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg > accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING > 2018-01-24 23:55:32.709816 7f44071fe700 0 log_channel(cluster) log [WRN] : > Monitor daemon marked osd.72 down, but it is still running > 2018-01-24 23:55:32.709829 7f44071fe700 0 log_channel(cluster) log [DBG] : > map e1809 wrongly marked me down at e1809 It's highly likely this is a network connectivity issue (or your machine is struggling under load, but that should be obvious to detect). > 2018-01-24 23:55:32.709832 7f44071fe700 0 osd.72 1809 _committed_osd_maps > marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, > shutting down > 2018-01-24 23:55:32.709838 7f44071fe700 1 osd.72 1809 > start_waiting_for_healthy > 2018-01-24 23:55:32.723353 7f44011f2700 1 osd.72 pg_epoch: 1809 pg[1.a58( > empty local-lis/les=1803/1804 n=0 ec=675/675 lis/c 1803/1803 les/c/f > 1804/1804/0 1809/1809/1653) [70,85] r=-1 lpr=1809 > pi=[1803,1809)/1 crt=0'0 active] start_peering_interval up [70,85,72] -> > [70,85], acting [70,85,72] -> [70,85], acting_primary 70 -> 70, up_primary > 70 -> 70, role 2 -> -1, > features acting 2305244844532236283 upacting 2305244844532236283 > ... > ... > ... > 2018-01-24 23:55:32.746608 7f44009f1700 1 osd.72 pg_epoch: 1809 pg[1.9b( > empty local-lis/les=1803/1804 n=0 ec=671/671 lis/c 1803/1803 les/c/f > 1804/1804/0 1809/1809/671) [62,95] r=-1 lpr=1809 pi=[1803,1809)/1 crt=0'0 > unknown NOTIFY] state<Start>: transitioning to Stray > 2018-01-24 23:55:32.746708 7f44071fe700 0 osd.72 1809 _committed_osd_maps > shutdown OSD via async signal > 2018-01-24 23:55:32.746794 7f43f89e1700 -1 Fail to open '/proc/0/cmdline' > error = (2) No such file or directory > 2018-01-24 23:55:32.746814 7f43f89e1700 -1 received signal: Interrupt from > PID: 0 task name: <unknown> UID: 0 > 2018-01-24 23:55:32.746818 7f43f89e1700 -1 osd.72 1809 *** Got signal > Interrupt *** > 2018-01-24 23:55:32.746824 7f43f89e1700 0 osd.72 1809 prepare_to_stop > starting shutdown > 2018-01-24 23:55:32.746827 7f43f89e1700 -1 osd.72 1809 shutdown > 2018-01-24 23:55:34.753347 7f43f89e1700 1 > bluestore(/var/lib/ceph/osd/ceph-72) umount > 2018-01-24 23:55:34.871899 7f43f89e1700 1 stupidalloc shutdown > 2018-01-24 23:55:34.871913 7f43f89e1700 1 freelist shutdown > 2018-01-24 23:55:34.871956 7f43f89e1700 4 rocksdb: > [/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all > background work > 2018-01-24 23:55:34.877019 7f43f89e1700 4 rocksdb: > [/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:343] Shutdown complete > 2018-01-24 23:55:34.877245 7f43f89e1700 1 bluefs umount > 2018-01-24 23:55:34.877254 7f43f89e1700 1 stupidalloc shutdown > 2018-01-24 23:55:34.877256 7f43f89e1700 1 stupidalloc shutdown > 2018-01-24 23:55:34.877257 7f43f89e1700 1 stupidalloc shutdown > 2018-01-24 23:55:34.877296 7f43f89e1700 1 bdev(0x5586c3e73440 > /var/lib/ceph/osd/ceph-72/block.wal) close > 2018-01-24 23:55:35.148199 7f43f89e1700 1 bdev(0x5586c3e72fc0 > /var/lib/ceph/osd/ceph-72/block.db) close > 2018-01-24 23:55:35.376184 7f43f89e1700 1 bdev(0x5586c3e73200 > /var/lib/ceph/osd/ceph-72/block) close > 2018-01-24 23:55:35.556147 7f43f89e1700 1 bdev(0x5586c3e72d80 > /var/lib/ceph/osd/ceph-72/block) close > > > I found this mailing list post > (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021275.html) > which pointed to what may be a bug (http://tracker.ceph.com/issues/20174). > > But it's odd that this is happening on ONLY one of our 8 OSD nodes (each > have 12 disks/OSDs). And I can issue a 'systemctl start > ceph-sod@$osd_number' and it starts back up without issue. > > Thoughts? > > -- > Andre Goree > -=-=-=-=-=- > Email - andre at drenet.net > Website - http://blog.drenet.net > PGP key - http://www.drenet.net/pubkey.html > -=-=-=-=-=- > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com