Re: OSD had suicide timed out

Brad Hubbard <bhubbard@xxxxxxxxxx> · Wed, 8 Aug 2018 17:52:08 +1000

Do you see "internal heartbeat not healthy" messages in the log of the
osd that suicides?

On Wed, Aug 8, 2018 at 5:45 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
> What is the load like on the osd host at the time and what does the
> disk utilization look like?
>
> Also, what does the transaction look like from one of the osds that
> sends the "you died" message with debugging osd 20 and ms 1 enabled?
>
> On Wed, Aug 8, 2018 at 5:34 PM, Josef Zelenka
> <josef.zelenka@xxxxxxxxxxxxxxxx> wrote:
>> Thank you for your suggestion, tried it,  really seems like the other osds
>> think the osd is dead(if I understand this right), however the networking
>> seems absolutely fine between the nodes(no issues in graphs etc).
>>
>>    -13> 2018-08-08 09:13:58.466119 7fe053d41700  1 -- 10.12.3.17:0/706864
>> <== osd.12 10.12.3.17:6807/4624236 81 ==== osd_ping(ping_reply e13452 stamp
>> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (687351303 0 0) 0x55731eb73e00
>> con 0x55731e7d4800
>>    -12> 2018-08-08 09:13:58.466140 7fe054542700  1 -- 10.12.3.17:0/706864
>> <== osd.11 10.12.3.16:6812/19232 81 ==== osd_ping(ping_reply e13452 stamp
>> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (687351303 0 0) 0x55733c391200
>> con 0x55731e7a5800
>>    -11> 2018-08-08 09:13:58.466147 7fe053540700  1 -- 10.12.125.3:0/706864
>> <== osd.11 10.12.125.2:6811/19232 82 ==== osd_ping(you_died e13452 stamp
>> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (3111562112 0 0) 0x55731eb66800
>> con 0x55731e7a4000
>>    -10> 2018-08-08 09:13:58.466164 7fe054542700  1 -- 10.12.3.17:0/706864
>> <== osd.11 10.12.3.16:6812/19232 82 ==== osd_ping(you_died e13452 stamp
>> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (3111562112 0 0) 0x55733c391200
>> con 0x55731e7a5800
>>     -9> 2018-08-08 09:13:58.466164 7fe053d41700  1 -- 10.12.3.17:0/706864
>> <== osd.12 10.12.3.17:6807/4624236 82 ==== osd_ping(you_died e13452 stamp
>> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (3111562112 0 0) 0x55731eb73e00
>> con 0x55731e7d4800
>>     -8> 2018-08-08 09:13:58.466176 7fe053540700  1 -- 10.12.3.17:0/706864
>> <== osd.9 10.12.3.16:6813/10016600 81 ==== osd_ping(ping_reply e13452 stamp
>> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (687351303 0 0) 0x55731eb66800
>> con 0x55731e732000
>>     -7> 2018-08-08 09:13:58.466200 7fe053d41700  1 -- 10.12.3.17:0/706864
>> <== osd.10 10.12.3.16:6810/2017908 81 ==== osd_ping(ping_reply e13452 stamp
>> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (687351303 0 0) 0x55731eb73e00
>> con 0x55731e796800
>>     -6> 2018-08-08 09:13:58.466208 7fe053540700  1 -- 10.12.3.17:0/706864
>> <== osd.9 10.12.3.16:6813/10016600 82 ==== osd_ping(you_died e13452 stamp
>> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (3111562112 0 0) 0x55731eb66800
>> con 0x55731e732000
>>     -5> 2018-08-08 09:13:58.466222 7fe053d41700  1 -- 10.12.3.17:0/706864
>> <== osd.10 10.12.3.16:6810/2017908 82 ==== osd_ping(you_died e13452 stamp
>> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (3111562112 0 0) 0x55731eb73e00
>> con 0x55731e796800
>>     -4> 2018-08-08 09:13:59.748336 7fe040531700  1 -- 10.12.3.17:6802/706864
>> --> 10.12.3.16:6800/1677830 -- mgrreport(unknown.13 +0-0 packed 742
>> osd_metrics=1) v5 -- 0x55731fa4af00 con 0
>>     -3> 2018-08-08 09:13:59.748538 7fe040531700  1 -- 10.12.3.17:6802/706864
>> --> 10.12.3.16:6800/1677830 -- pg_stats(64 pgs tid 0 v 0) v1 --
>> 0x55733cbf4c00 con 0
>>     -2> 2018-08-08 09:14:00.953804 7fe0525a1700  1 heartbeat_map is_healthy
>> 'OSD::peering_tp thread 0x7fe03f52f700' had timed out after 15
>>     -1> 2018-08-08 09:14:00.953857 7fe0525a1700  1 heartbeat_map is_healthy
>> 'OSD::peering_tp thread 0x7fe03f52f700' had suicide timed out after 150
>>      0> 2018-08-08 09:14:00.970742 7fe03f52f700 -1 *** Caught signal
>> (Aborted) **
>>
>>
>> Could it be that the suiciding OSDs are rejecting the ping somehow? I'm
>> quite confused as on what's really going on here, it seems completely random
>> to me.
>>
>>
>> On 08/08/18 01:51, Brad Hubbard wrote:
>>>
>>> Try to work out why the other osds are saying this one is down. Is it
>>> because this osd is too busy to respond or something else.
>>>
>>> debug_ms = 1 will show you some message debugging which may help.
>>>
>>> On Tue, Aug 7, 2018 at 10:34 PM, Josef Zelenka
>>> <josef.zelenka@xxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> To follow up, I did some further digging with debug_osd=20/20 and it
>>>> appears
>>>> as if there's no traffic to the OSD, even though it comes UP for the
>>>> cluster
>>>> (this started happening on another OSD in the cluster today, same stuff):
>>>>
>>>>     -27> 2018-08-07 14:10:55.146531 7f9fce3cd700 10 osd.0 12560
>>>> handle_osd_ping osd.17 10.12.3.17:6811/19661 says i am down in 12566
>>>>     -26> 2018-08-07 14:10:55.146542 7f9fcebce700 10 osd.0 12560
>>>> handle_osd_ping osd.12 10.12.125.3:6807/4624236 says i am down in 12566
>>>>     -25> 2018-08-07 14:10:55.146551 7f9fcf3cf700 10 osd.0 12560
>>>> handle_osd_ping osd.13 10.12.3.17:6805/186262 says i am down in 12566
>>>>     -24> 2018-08-07 14:10:55.146564 7f9fce3cd700 20 osd.0 12559
>>>> share_map_peer 0x56308a9d0000 already has epoch 12566
>>>>     -23> 2018-08-07 14:10:55.146576 7f9fcebce700 20 osd.0 12559
>>>> share_map_peer 0x56308abb9800 already has epoch 12566
>>>>     -22> 2018-08-07 14:10:55.146590 7f9fcf3cf700 20 osd.0 12559
>>>> share_map_peer 0x56308abb1000 already has epoch 12566
>>>>     -21> 2018-08-07 14:10:55.146600 7f9fce3cd700 10 osd.0 12560
>>>> handle_osd_ping osd.15 10.12.125.3:6813/49064793 says i am down in 12566
>>>>     -20> 2018-08-07 14:10:55.146609 7f9fcebce700 10 osd.0 12560
>>>> handle_osd_ping osd.16 10.12.3.17:6801/1018363 says i am down in 12566
>>>>     -19> 2018-08-07 14:10:55.146619 7f9fcf3cf700 10 osd.0 12560
>>>> handle_osd_ping osd.11 10.12.3.16:6812/19232 says i am down in 12566
>>>>     -18> 2018-08-07 14:10:55.146643 7f9fcf3cf700 20 osd.0 12559
>>>> share_map_peer 0x56308a9d0000 already has epoch 12566
>>>>     -17> 2018-08-07 14:10:55.146653 7f9fcf3cf700 10 osd.0 12560
>>>> handle_osd_ping osd.15 10.12.3.17:6812/49064793 says i am down in 12566
>>>>     -16> 2018-08-07 14:10:55.448468 7f9fcabdd700 10 osd.0 12560
>>>> tick_without_osd_lock
>>>>     -15> 2018-08-07 14:10:55.448491 7f9fcabdd700 20 osd.0 12559
>>>> can_inc_scrubs_pending 0 -> 1 (max 1, active 0)
>>>>     -14> 2018-08-07 14:10:55.448497 7f9fcabdd700 20 osd.0 12560
>>>> scrub_time_permit should run between 0 - 24 now 14 = yes
>>>>     -13> 2018-08-07 14:10:55.448525 7f9fcabdd700 20 osd.0 12560
>>>> scrub_load_below_threshold loadavg 2.31 < daily_loadavg 2.68855 and < 15m
>>>> avg 2.63 = yes
>>>>     -12> 2018-08-07 14:10:55.448535 7f9fcabdd700 20 osd.0 12560
>>>> sched_scrub
>>>> load_is_low=1
>>>>     -11> 2018-08-07 14:10:55.448555 7f9fcabdd700 10 osd.0 12560
>>>> sched_scrub
>>>> 15.112 scheduled at 2018-08-07 15:03:15.052952 > 2018-08-07
>>>> 14:10:55.448494
>>>>     -10> 2018-08-07 14:10:55.448563 7f9fcabdd700 20 osd.0 12560
>>>> sched_scrub
>>>> done
>>>>      -9> 2018-08-07 14:10:55.448565 7f9fcabdd700 10 osd.0 12559
>>>> promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 bytes;
>>>> target 25 obj/sec or 5120 k bytes/sec
>>>>      -8> 2018-08-07 14:10:55.448568 7f9fcabdd700 20 osd.0 12559
>>>> promote_throttle_recalibrate  new_prob 1000
>>>>      -7> 2018-08-07 14:10:55.448569 7f9fcabdd700 10 osd.0 12559
>>>> promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted
>>>> new_prob 1000, prob 1000 -> 1000
>>>>      -6> 2018-08-07 14:10:55.507159 7f9faab9d700 20 osd.0 op_wq(5)
>>>> _process
>>>> empty q, waiting
>>>>      -5> 2018-08-07 14:10:55.812434 7f9fb5bb3700 20 osd.0 op_wq(7)
>>>> _process
>>>> empty q, waiting
>>>>      -4> 2018-08-07 14:10:56.236584 7f9fcd42e700  1 heartbeat_map
>>>> is_healthy
>>>> 'OSD::osd_op_tp thread 0x7f9fa7396700' had timed out after 60
>>>>      -3> 2018-08-07 14:10:56.236618 7f9fcd42e700  1 heartbeat_map
>>>> is_healthy
>>>> 'OSD::osd_op_tp thread 0x7f9fb33ae700' had timed out after 60
>>>>      -2> 2018-08-07 14:10:56.236621 7f9fcd42e700  1 heartbeat_map
>>>> is_healthy
>>>> 'OSD::peering_tp thread 0x7f9fba3bc700' had timed out after 15
>>>>      -1> 2018-08-07 14:10:56.236640 7f9fcd42e700  1 heartbeat_map
>>>> is_healthy
>>>> 'OSD::peering_tp thread 0x7f9fba3bc700' had suicide timed out after 150
>>>>       0> 2018-08-07 14:10:56.245420 7f9fba3bc700 -1 *** Caught signal
>>>> (Aborted) **
>>>>   in thread 7f9fba3bc700 thread_name:tp_peering
>>>>
>>>> THe osd cyclically crashes and comes back up. I tried modifying the
>>>> recovery
>>>> etc timeouts, but no luck - the situation is still the same. Regarding
>>>> the
>>>> radosgw, across all nodes, after starting the rgw process, i only get
>>>> this:
>>>>
>>>> 2018-08-07 14:32:17.852785 7f482dcaf700  2
>>>> RGWDataChangesLog::ChangesRenewThread: start
>>>>
>>>> I found this thread in the ceph mailing list
>>>>
>>>> (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018956.html)
>>>> but I'm not sure if this is the same thing(albeit, it's the same error),
>>>> as
>>>> I don't use s3 acls/expiration in my cluster(if it's set to a default,
>>>> I'm
>>>> not aware of it)
>>>>
>>>>
>>>>
>>>>
>>>> On 06/08/18 16:30, Josef Zelenka wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> i'm running a cluster on Luminous(12.2.5), Ubuntu 16.04 - configuration
>>>>> is
>>>>> 3 nodes, 6 drives each(though i have encountered this on a different
>>>>> cluster, similar hardware, only the drives were HDD instead of SSD -
>>>>> same
>>>>> usage). I have recently seen a bug(?) where one of the OSDs suddenly
>>>>> spikes
>>>>> in iops and constantly restarts(trying to load the journal/filemap
>>>>> apparently) which renders the radosgw(primary usage of this cluster)
>>>>> unable
>>>>> to write. The only thing that helps here is stopping the OSD, but that
>>>>> helps
>>>>> only until another one does the similar thing. Any clue on the cause of
>>>>> this? LOgs of the osd when it crashes below. THanks
>>>>>
>>>>> Josef
>>>>>
>>>>>   -9920> 2018-08-06 12:12:10.588227 7f8e7afcb700  1 heartbeat_map
>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60
>>>>>   -9919> 2018-08-06 12:12:10.607070 7f8e7a7ca700  1 heartbeat_map
>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60
>>>>> --
>>>>>      -1> 2018-08-06 14:12:52.428994 7f8e7982b700  1 heartbeat_map
>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had suicide timed out
>>>>> after 150
>>>>>       0> 2018-08-06 14:12:52.432088 7f8e56f9a700 -1 *** Caught signal
>>>>> (Aborted) **
>>>>>   in thread 7f8e56f9a700 thread_name:tp_osd_tp
>>>>>
>>>>>   ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
>>>>> luminous
>>>>> (stable)
>>>>>   1: (()+0xa7cab4) [0x55868269aab4]
>>>>>   2: (()+0x11390) [0x7f8e7e51d390]
>>>>>   3: (()+0x1026d) [0x7f8e7e51c26d]
>>>>>   4: (pthread_mutex_lock()+0x7d) [0x7f8e7e515dbd]
>>>>>   5: (Mutex::Lock(bool)+0x49) [0x5586826bb899]
>>>>>   6: (PG::lock(bool) const+0x33) [0x55868216ace3]
>>>>>   7: (OSD::ShardedOpWQ::_process(unsigned int,
>>>>> ceph::heartbeat_handle_d*)+0x844) [0x558682101044]
>>>>>   8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884)
>>>>> [0x5586826e27f4]
>>>>>   9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
>>>>> [0x5586826e5830]
>>>>>   10: (()+0x76ba) [0x7f8e7e5136ba]
>>>>>   11: (clone()+0x6d) [0x7f8e7d58a41d]
>>>>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>> needed
>>>>> to interpret this.
>>>>>
>>>>> --- logging levels ---
>>>>>     0/ 5 none
>>>>>     0/ 0 lockdep
>>>>>     0/ 0 context
>>>>>     0/ 0 crush
>>>>>     1/ 5 mds
>>>>>     1/ 5 mds_balancer
>>>>>     1/ 5 mds_locker
>>>>>     1/ 5 mds_log
>>>>>     1/ 5 mds_log_expire
>>>>>     1/ 5 mds_migrator
>>>>>     0/ 0 buffer
>>>>>     0/ 0 timer
>>>>>     0/ 0 filer
>>>>>     0/ 1 striper
>>>>>     0/ 0 objecter
>>>>>     0/ 0 rados
>>>>>     0/ 0 rbd
>>>>>     0/ 5 rbd_mirror
>>>>>     0/ 5 rbd_replay
>>>>>     0/ 0 journaler
>>>>>     0/ 0 objectcacher
>>>>>     0/ 0 client
>>>>>     0/ 0 osd
>>>>>     0/ 0 optracker
>>>>>     0/ 0 objclass
>>>>>     0/ 0 filestore
>>>>>     0/ 0 journal
>>>>>     0/ 0 ms
>>>>>     0/ 0 mon
>>>>>     0/ 0 monc
>>>>>     0/ 0 paxos
>>>>>     0/ 0 tp
>>>>>     0/ 0 auth
>>>>>     1/ 5 crypto
>>>>>     0/ 0 finisher
>>>>>     1/ 1 reserver
>>>>>     1/ 5 heartbeatmap
>>>>>     0/ 0 perfcounter
>>>>>     0/ 0 rgw
>>>>>     1/10 civetweb
>>>>>     1/ 5 javaclient
>>>>>     0/ 0 asok
>>>>>     0/ 0 throttle
>>>>>     0/ 0 refs
>>>>>     1/ 5 xio
>>>>>     1/ 5 compressor
>>>>>     1/ 5 bluestore
>>>>>     1/ 5 bluefs
>>>>>     1/ 3 bdev
>>>>>     1/ 5 kstore
>>>>>     4/ 5 rocksdb
>>>>>     4/ 5 leveldb
>>>>>     4/ 5 memdb
>>>>>     1/ 5 kinetic
>>>>>     1/ 5 fuse
>>>>>     1/ 5 mgr
>>>>>     1/ 5 mgrc
>>>>>     1/ 5 dpdk
>>>>>     1/ 5 eventtrace
>>>>>    -2/-2 (syslog threshold)
>>>>>    -1/-1 (stderr threshold)
>>>>>    max_recent     10000
>>>>>    max_new         1000
>>>>>    log_file /var/log/ceph/ceph-osd.7.log
>>>>> --- end dump of recent events ---
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>
>
>
>
> --
> Cheers,
> Brad

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com