Re: OSDs missing from cluster all from one node

Brad Hubbard <bhubbard@xxxxxxxxxx> · Fri, 26 Jan 2018 10:09:52 +1000

On Fri, Jan 26, 2018 at 5:47 AM, Andre Goree <andre@xxxxxxxxxx> wrote:
> On 2018/01/25 2:03 pm, Andre Goree wrote:
>>
>> Yesterday I noticed some OSDs were missing from our cluster (96 OSDs
>> total, 84up/84in is what showed).
>>
>> After drilling down to determine which node and the cause, I found
>> that all the OSDs on that node (12 total) were in fact down.
>>
>> I entered 'systemctl status ceph-osd@$osd_number' to determine exactly
>> why they were down, and came up with:
>> Fail to open '/proc/0/cmdline' error = (2) No such file or directory
>> received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0
>> osd.72 1067 *** Got signal Interrupt ***
>> osd.72 1067 shutdown
>>
>> This happened on all twelve OSDs (osd.72-osd.83).  On four, it
>> happened the previous evening around 9pm EST and the other eight
>> happened at roughly 2am EST the morning I discovered the issue
>> (discovered around 9am EST).
>>
>> Has anyone ever come across something like this or perhaps know of a
>> fix?  This hasn't happened since, but this being a newly built-out
>> cluster it was a bit concerning.
>>
>> Thanks in advance.
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> Responding from a diff email address because Outlook is a PITA.
>
> So it appears this issue _has_ actually indeed happened again.  Viewing the
> OSD log, I'm seeing the following (which I've Google'd and may or may not be
> a bug):
>
> 2018-01-24 23:54:45.889803 7f4416201700  0 -- 172.16.239.21:6808/22069164 >>
> 172.16.239.19:6806/2031213 conn(0x5586c8bef000 :6808
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
> accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
> 2018-01-24 23:55:32.709816 7f44071fe700  0 log_channel(cluster) log [WRN] :
> Monitor daemon marked osd.72 down, but it is still running
> 2018-01-24 23:55:32.709829 7f44071fe700  0 log_channel(cluster) log [DBG] :
> map e1809 wrongly marked me down at e1809

It's highly likely this is a network connectivity issue (or your
machine is struggling under load, but that should be obvious to
detect).

> 2018-01-24 23:55:32.709832 7f44071fe700  0 osd.72 1809 _committed_osd_maps
> marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds,
> shutting down
> 2018-01-24 23:55:32.709838 7f44071fe700  1 osd.72 1809
> start_waiting_for_healthy
> 2018-01-24 23:55:32.723353 7f44011f2700  1 osd.72 pg_epoch: 1809 pg[1.a58(
> empty local-lis/les=1803/1804 n=0 ec=675/675 lis/c 1803/1803 les/c/f
> 1804/1804/0 1809/1809/1653) [70,85] r=-1 lpr=1809
> pi=[1803,1809)/1 crt=0'0 active] start_peering_interval up [70,85,72] ->
> [70,85], acting [70,85,72] -> [70,85], acting_primary 70 -> 70, up_primary
> 70 -> 70, role 2 -> -1,
> features acting 2305244844532236283 upacting 2305244844532236283
> ...
> ...
> ...
> 2018-01-24 23:55:32.746608 7f44009f1700  1 osd.72 pg_epoch: 1809 pg[1.9b(
> empty local-lis/les=1803/1804 n=0 ec=671/671 lis/c 1803/1803 les/c/f
> 1804/1804/0 1809/1809/671) [62,95] r=-1 lpr=1809 pi=[1803,1809)/1 crt=0'0
> unknown NOTIFY] state<Start>: transitioning to Stray
> 2018-01-24 23:55:32.746708 7f44071fe700  0 osd.72 1809 _committed_osd_maps
> shutdown OSD via async signal
> 2018-01-24 23:55:32.746794 7f43f89e1700 -1 Fail to open '/proc/0/cmdline'
> error = (2) No such file or directory
> 2018-01-24 23:55:32.746814 7f43f89e1700 -1 received  signal: Interrupt from
> PID: 0 task name: <unknown> UID: 0
> 2018-01-24 23:55:32.746818 7f43f89e1700 -1 osd.72 1809 *** Got signal
> Interrupt ***
> 2018-01-24 23:55:32.746824 7f43f89e1700  0 osd.72 1809 prepare_to_stop
> starting shutdown
> 2018-01-24 23:55:32.746827 7f43f89e1700 -1 osd.72 1809 shutdown
> 2018-01-24 23:55:34.753347 7f43f89e1700  1
> bluestore(/var/lib/ceph/osd/ceph-72) umount
> 2018-01-24 23:55:34.871899 7f43f89e1700  1 stupidalloc shutdown
> 2018-01-24 23:55:34.871913 7f43f89e1700  1 freelist shutdown
> 2018-01-24 23:55:34.871956 7f43f89e1700  4 rocksdb:
> [/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all
> background work
> 2018-01-24 23:55:34.877019 7f43f89e1700  4 rocksdb:
> [/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:343] Shutdown complete
> 2018-01-24 23:55:34.877245 7f43f89e1700  1 bluefs umount
> 2018-01-24 23:55:34.877254 7f43f89e1700  1 stupidalloc shutdown
> 2018-01-24 23:55:34.877256 7f43f89e1700  1 stupidalloc shutdown
> 2018-01-24 23:55:34.877257 7f43f89e1700  1 stupidalloc shutdown
> 2018-01-24 23:55:34.877296 7f43f89e1700  1 bdev(0x5586c3e73440
> /var/lib/ceph/osd/ceph-72/block.wal) close
> 2018-01-24 23:55:35.148199 7f43f89e1700  1 bdev(0x5586c3e72fc0
> /var/lib/ceph/osd/ceph-72/block.db) close
> 2018-01-24 23:55:35.376184 7f43f89e1700  1 bdev(0x5586c3e73200
> /var/lib/ceph/osd/ceph-72/block) close
> 2018-01-24 23:55:35.556147 7f43f89e1700  1 bdev(0x5586c3e72d80
> /var/lib/ceph/osd/ceph-72/block) close
>
>
> I found this mailing list post
> (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021275.html)
> which pointed to what may be a bug (http://tracker.ceph.com/issues/20174).
>
> But it's odd that this is happening on ONLY one of our 8 OSD nodes (each
> have 12 disks/OSDs).  And I can issue a 'systemctl start
> ceph-sod@$osd_number' and it starts back up without issue.
>
> Thoughts?
>
> --
> Andre Goree
> -=-=-=-=-=-
> Email     - andre at drenet.net
> Website   - http://blog.drenet.net
> PGP key   - http://www.drenet.net/pubkey.html
> -=-=-=-=-=-
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com