Do you see "internal heartbeat not healthy" messages in the log of the osd that suicides? On Wed, Aug 8, 2018 at 5:45 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote: > What is the load like on the osd host at the time and what does the > disk utilization look like? > > Also, what does the transaction look like from one of the osds that > sends the "you died" message with debugging osd 20 and ms 1 enabled? > > On Wed, Aug 8, 2018 at 5:34 PM, Josef Zelenka > <josef.zelenka@xxxxxxxxxxxxxxxx> wrote: >> Thank you for your suggestion, tried it, really seems like the other osds >> think the osd is dead(if I understand this right), however the networking >> seems absolutely fine between the nodes(no issues in graphs etc). >> >> -13> 2018-08-08 09:13:58.466119 7fe053d41700 1 -- 10.12.3.17:0/706864 >> <== osd.12 10.12.3.17:6807/4624236 81 ==== osd_ping(ping_reply e13452 stamp >> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (687351303 0 0) 0x55731eb73e00 >> con 0x55731e7d4800 >> -12> 2018-08-08 09:13:58.466140 7fe054542700 1 -- 10.12.3.17:0/706864 >> <== osd.11 10.12.3.16:6812/19232 81 ==== osd_ping(ping_reply e13452 stamp >> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (687351303 0 0) 0x55733c391200 >> con 0x55731e7a5800 >> -11> 2018-08-08 09:13:58.466147 7fe053540700 1 -- 10.12.125.3:0/706864 >> <== osd.11 10.12.125.2:6811/19232 82 ==== osd_ping(you_died e13452 stamp >> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (3111562112 0 0) 0x55731eb66800 >> con 0x55731e7a4000 >> -10> 2018-08-08 09:13:58.466164 7fe054542700 1 -- 10.12.3.17:0/706864 >> <== osd.11 10.12.3.16:6812/19232 82 ==== osd_ping(you_died e13452 stamp >> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (3111562112 0 0) 0x55733c391200 >> con 0x55731e7a5800 >> -9> 2018-08-08 09:13:58.466164 7fe053d41700 1 -- 10.12.3.17:0/706864 >> <== osd.12 10.12.3.17:6807/4624236 82 ==== osd_ping(you_died e13452 stamp >> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (3111562112 0 0) 0x55731eb73e00 >> con 0x55731e7d4800 >> -8> 2018-08-08 09:13:58.466176 7fe053540700 1 -- 10.12.3.17:0/706864 >> <== osd.9 10.12.3.16:6813/10016600 81 ==== osd_ping(ping_reply e13452 stamp >> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (687351303 0 0) 0x55731eb66800 >> con 0x55731e732000 >> -7> 2018-08-08 09:13:58.466200 7fe053d41700 1 -- 10.12.3.17:0/706864 >> <== osd.10 10.12.3.16:6810/2017908 81 ==== osd_ping(ping_reply e13452 stamp >> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (687351303 0 0) 0x55731eb73e00 >> con 0x55731e796800 >> -6> 2018-08-08 09:13:58.466208 7fe053540700 1 -- 10.12.3.17:0/706864 >> <== osd.9 10.12.3.16:6813/10016600 82 ==== osd_ping(you_died e13452 stamp >> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (3111562112 0 0) 0x55731eb66800 >> con 0x55731e732000 >> -5> 2018-08-08 09:13:58.466222 7fe053d41700 1 -- 10.12.3.17:0/706864 >> <== osd.10 10.12.3.16:6810/2017908 82 ==== osd_ping(you_died e13452 stamp >> 2018-08-08 09:13:58.464608) v4 ==== 2004+0+0 (3111562112 0 0) 0x55731eb73e00 >> con 0x55731e796800 >> -4> 2018-08-08 09:13:59.748336 7fe040531700 1 -- 10.12.3.17:6802/706864 >> --> 10.12.3.16:6800/1677830 -- mgrreport(unknown.13 +0-0 packed 742 >> osd_metrics=1) v5 -- 0x55731fa4af00 con 0 >> -3> 2018-08-08 09:13:59.748538 7fe040531700 1 -- 10.12.3.17:6802/706864 >> --> 10.12.3.16:6800/1677830 -- pg_stats(64 pgs tid 0 v 0) v1 -- >> 0x55733cbf4c00 con 0 >> -2> 2018-08-08 09:14:00.953804 7fe0525a1700 1 heartbeat_map is_healthy >> 'OSD::peering_tp thread 0x7fe03f52f700' had timed out after 15 >> -1> 2018-08-08 09:14:00.953857 7fe0525a1700 1 heartbeat_map is_healthy >> 'OSD::peering_tp thread 0x7fe03f52f700' had suicide timed out after 150 >> 0> 2018-08-08 09:14:00.970742 7fe03f52f700 -1 *** Caught signal >> (Aborted) ** >> >> >> Could it be that the suiciding OSDs are rejecting the ping somehow? I'm >> quite confused as on what's really going on here, it seems completely random >> to me. >> >> >> On 08/08/18 01:51, Brad Hubbard wrote: >>> >>> Try to work out why the other osds are saying this one is down. Is it >>> because this osd is too busy to respond or something else. >>> >>> debug_ms = 1 will show you some message debugging which may help. >>> >>> On Tue, Aug 7, 2018 at 10:34 PM, Josef Zelenka >>> <josef.zelenka@xxxxxxxxxxxxxxxx> wrote: >>>> >>>> To follow up, I did some further digging with debug_osd=20/20 and it >>>> appears >>>> as if there's no traffic to the OSD, even though it comes UP for the >>>> cluster >>>> (this started happening on another OSD in the cluster today, same stuff): >>>> >>>> -27> 2018-08-07 14:10:55.146531 7f9fce3cd700 10 osd.0 12560 >>>> handle_osd_ping osd.17 10.12.3.17:6811/19661 says i am down in 12566 >>>> -26> 2018-08-07 14:10:55.146542 7f9fcebce700 10 osd.0 12560 >>>> handle_osd_ping osd.12 10.12.125.3:6807/4624236 says i am down in 12566 >>>> -25> 2018-08-07 14:10:55.146551 7f9fcf3cf700 10 osd.0 12560 >>>> handle_osd_ping osd.13 10.12.3.17:6805/186262 says i am down in 12566 >>>> -24> 2018-08-07 14:10:55.146564 7f9fce3cd700 20 osd.0 12559 >>>> share_map_peer 0x56308a9d0000 already has epoch 12566 >>>> -23> 2018-08-07 14:10:55.146576 7f9fcebce700 20 osd.0 12559 >>>> share_map_peer 0x56308abb9800 already has epoch 12566 >>>> -22> 2018-08-07 14:10:55.146590 7f9fcf3cf700 20 osd.0 12559 >>>> share_map_peer 0x56308abb1000 already has epoch 12566 >>>> -21> 2018-08-07 14:10:55.146600 7f9fce3cd700 10 osd.0 12560 >>>> handle_osd_ping osd.15 10.12.125.3:6813/49064793 says i am down in 12566 >>>> -20> 2018-08-07 14:10:55.146609 7f9fcebce700 10 osd.0 12560 >>>> handle_osd_ping osd.16 10.12.3.17:6801/1018363 says i am down in 12566 >>>> -19> 2018-08-07 14:10:55.146619 7f9fcf3cf700 10 osd.0 12560 >>>> handle_osd_ping osd.11 10.12.3.16:6812/19232 says i am down in 12566 >>>> -18> 2018-08-07 14:10:55.146643 7f9fcf3cf700 20 osd.0 12559 >>>> share_map_peer 0x56308a9d0000 already has epoch 12566 >>>> -17> 2018-08-07 14:10:55.146653 7f9fcf3cf700 10 osd.0 12560 >>>> handle_osd_ping osd.15 10.12.3.17:6812/49064793 says i am down in 12566 >>>> -16> 2018-08-07 14:10:55.448468 7f9fcabdd700 10 osd.0 12560 >>>> tick_without_osd_lock >>>> -15> 2018-08-07 14:10:55.448491 7f9fcabdd700 20 osd.0 12559 >>>> can_inc_scrubs_pending 0 -> 1 (max 1, active 0) >>>> -14> 2018-08-07 14:10:55.448497 7f9fcabdd700 20 osd.0 12560 >>>> scrub_time_permit should run between 0 - 24 now 14 = yes >>>> -13> 2018-08-07 14:10:55.448525 7f9fcabdd700 20 osd.0 12560 >>>> scrub_load_below_threshold loadavg 2.31 < daily_loadavg 2.68855 and < 15m >>>> avg 2.63 = yes >>>> -12> 2018-08-07 14:10:55.448535 7f9fcabdd700 20 osd.0 12560 >>>> sched_scrub >>>> load_is_low=1 >>>> -11> 2018-08-07 14:10:55.448555 7f9fcabdd700 10 osd.0 12560 >>>> sched_scrub >>>> 15.112 scheduled at 2018-08-07 15:03:15.052952 > 2018-08-07 >>>> 14:10:55.448494 >>>> -10> 2018-08-07 14:10:55.448563 7f9fcabdd700 20 osd.0 12560 >>>> sched_scrub >>>> done >>>> -9> 2018-08-07 14:10:55.448565 7f9fcabdd700 10 osd.0 12559 >>>> promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 bytes; >>>> target 25 obj/sec or 5120 k bytes/sec >>>> -8> 2018-08-07 14:10:55.448568 7f9fcabdd700 20 osd.0 12559 >>>> promote_throttle_recalibrate new_prob 1000 >>>> -7> 2018-08-07 14:10:55.448569 7f9fcabdd700 10 osd.0 12559 >>>> promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted >>>> new_prob 1000, prob 1000 -> 1000 >>>> -6> 2018-08-07 14:10:55.507159 7f9faab9d700 20 osd.0 op_wq(5) >>>> _process >>>> empty q, waiting >>>> -5> 2018-08-07 14:10:55.812434 7f9fb5bb3700 20 osd.0 op_wq(7) >>>> _process >>>> empty q, waiting >>>> -4> 2018-08-07 14:10:56.236584 7f9fcd42e700 1 heartbeat_map >>>> is_healthy >>>> 'OSD::osd_op_tp thread 0x7f9fa7396700' had timed out after 60 >>>> -3> 2018-08-07 14:10:56.236618 7f9fcd42e700 1 heartbeat_map >>>> is_healthy >>>> 'OSD::osd_op_tp thread 0x7f9fb33ae700' had timed out after 60 >>>> -2> 2018-08-07 14:10:56.236621 7f9fcd42e700 1 heartbeat_map >>>> is_healthy >>>> 'OSD::peering_tp thread 0x7f9fba3bc700' had timed out after 15 >>>> -1> 2018-08-07 14:10:56.236640 7f9fcd42e700 1 heartbeat_map >>>> is_healthy >>>> 'OSD::peering_tp thread 0x7f9fba3bc700' had suicide timed out after 150 >>>> 0> 2018-08-07 14:10:56.245420 7f9fba3bc700 -1 *** Caught signal >>>> (Aborted) ** >>>> in thread 7f9fba3bc700 thread_name:tp_peering >>>> >>>> THe osd cyclically crashes and comes back up. I tried modifying the >>>> recovery >>>> etc timeouts, but no luck - the situation is still the same. Regarding >>>> the >>>> radosgw, across all nodes, after starting the rgw process, i only get >>>> this: >>>> >>>> 2018-08-07 14:32:17.852785 7f482dcaf700 2 >>>> RGWDataChangesLog::ChangesRenewThread: start >>>> >>>> I found this thread in the ceph mailing list >>>> >>>> (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018956.html) >>>> but I'm not sure if this is the same thing(albeit, it's the same error), >>>> as >>>> I don't use s3 acls/expiration in my cluster(if it's set to a default, >>>> I'm >>>> not aware of it) >>>> >>>> >>>> >>>> >>>> On 06/08/18 16:30, Josef Zelenka wrote: >>>>> >>>>> Hi, >>>>> >>>>> i'm running a cluster on Luminous(12.2.5), Ubuntu 16.04 - configuration >>>>> is >>>>> 3 nodes, 6 drives each(though i have encountered this on a different >>>>> cluster, similar hardware, only the drives were HDD instead of SSD - >>>>> same >>>>> usage). I have recently seen a bug(?) where one of the OSDs suddenly >>>>> spikes >>>>> in iops and constantly restarts(trying to load the journal/filemap >>>>> apparently) which renders the radosgw(primary usage of this cluster) >>>>> unable >>>>> to write. The only thing that helps here is stopping the OSD, but that >>>>> helps >>>>> only until another one does the similar thing. Any clue on the cause of >>>>> this? LOgs of the osd when it crashes below. THanks >>>>> >>>>> Josef >>>>> >>>>> -9920> 2018-08-06 12:12:10.588227 7f8e7afcb700 1 heartbeat_map >>>>> is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60 >>>>> -9919> 2018-08-06 12:12:10.607070 7f8e7a7ca700 1 heartbeat_map >>>>> is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60 >>>>> -- >>>>> -1> 2018-08-06 14:12:52.428994 7f8e7982b700 1 heartbeat_map >>>>> is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had suicide timed out >>>>> after 150 >>>>> 0> 2018-08-06 14:12:52.432088 7f8e56f9a700 -1 *** Caught signal >>>>> (Aborted) ** >>>>> in thread 7f8e56f9a700 thread_name:tp_osd_tp >>>>> >>>>> ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) >>>>> luminous >>>>> (stable) >>>>> 1: (()+0xa7cab4) [0x55868269aab4] >>>>> 2: (()+0x11390) [0x7f8e7e51d390] >>>>> 3: (()+0x1026d) [0x7f8e7e51c26d] >>>>> 4: (pthread_mutex_lock()+0x7d) [0x7f8e7e515dbd] >>>>> 5: (Mutex::Lock(bool)+0x49) [0x5586826bb899] >>>>> 6: (PG::lock(bool) const+0x33) [0x55868216ace3] >>>>> 7: (OSD::ShardedOpWQ::_process(unsigned int, >>>>> ceph::heartbeat_handle_d*)+0x844) [0x558682101044] >>>>> 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) >>>>> [0x5586826e27f4] >>>>> 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) >>>>> [0x5586826e5830] >>>>> 10: (()+0x76ba) [0x7f8e7e5136ba] >>>>> 11: (clone()+0x6d) [0x7f8e7d58a41d] >>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >>>>> needed >>>>> to interpret this. >>>>> >>>>> --- logging levels --- >>>>> 0/ 5 none >>>>> 0/ 0 lockdep >>>>> 0/ 0 context >>>>> 0/ 0 crush >>>>> 1/ 5 mds >>>>> 1/ 5 mds_balancer >>>>> 1/ 5 mds_locker >>>>> 1/ 5 mds_log >>>>> 1/ 5 mds_log_expire >>>>> 1/ 5 mds_migrator >>>>> 0/ 0 buffer >>>>> 0/ 0 timer >>>>> 0/ 0 filer >>>>> 0/ 1 striper >>>>> 0/ 0 objecter >>>>> 0/ 0 rados >>>>> 0/ 0 rbd >>>>> 0/ 5 rbd_mirror >>>>> 0/ 5 rbd_replay >>>>> 0/ 0 journaler >>>>> 0/ 0 objectcacher >>>>> 0/ 0 client >>>>> 0/ 0 osd >>>>> 0/ 0 optracker >>>>> 0/ 0 objclass >>>>> 0/ 0 filestore >>>>> 0/ 0 journal >>>>> 0/ 0 ms >>>>> 0/ 0 mon >>>>> 0/ 0 monc >>>>> 0/ 0 paxos >>>>> 0/ 0 tp >>>>> 0/ 0 auth >>>>> 1/ 5 crypto >>>>> 0/ 0 finisher >>>>> 1/ 1 reserver >>>>> 1/ 5 heartbeatmap >>>>> 0/ 0 perfcounter >>>>> 0/ 0 rgw >>>>> 1/10 civetweb >>>>> 1/ 5 javaclient >>>>> 0/ 0 asok >>>>> 0/ 0 throttle >>>>> 0/ 0 refs >>>>> 1/ 5 xio >>>>> 1/ 5 compressor >>>>> 1/ 5 bluestore >>>>> 1/ 5 bluefs >>>>> 1/ 3 bdev >>>>> 1/ 5 kstore >>>>> 4/ 5 rocksdb >>>>> 4/ 5 leveldb >>>>> 4/ 5 memdb >>>>> 1/ 5 kinetic >>>>> 1/ 5 fuse >>>>> 1/ 5 mgr >>>>> 1/ 5 mgrc >>>>> 1/ 5 dpdk >>>>> 1/ 5 eventtrace >>>>> -2/-2 (syslog threshold) >>>>> -1/-1 (stderr threshold) >>>>> max_recent 10000 >>>>> max_new 1000 >>>>> log_file /var/log/ceph/ceph-osd.7.log >>>>> --- end dump of recent events --- >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >> > > > > -- > Cheers, > Brad -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com