Try to work out why the other osds are saying this one is down. Is it because this osd is too busy to respond or something else. debug_ms = 1 will show you some message debugging which may help. On Tue, Aug 7, 2018 at 10:34 PM, Josef Zelenka <josef.zelenka@xxxxxxxxxxxxxxxx> wrote: > To follow up, I did some further digging with debug_osd=20/20 and it appears > as if there's no traffic to the OSD, even though it comes UP for the cluster > (this started happening on another OSD in the cluster today, same stuff): > > -27> 2018-08-07 14:10:55.146531 7f9fce3cd700 10 osd.0 12560 > handle_osd_ping osd.17 10.12.3.17:6811/19661 says i am down in 12566 > -26> 2018-08-07 14:10:55.146542 7f9fcebce700 10 osd.0 12560 > handle_osd_ping osd.12 10.12.125.3:6807/4624236 says i am down in 12566 > -25> 2018-08-07 14:10:55.146551 7f9fcf3cf700 10 osd.0 12560 > handle_osd_ping osd.13 10.12.3.17:6805/186262 says i am down in 12566 > -24> 2018-08-07 14:10:55.146564 7f9fce3cd700 20 osd.0 12559 > share_map_peer 0x56308a9d0000 already has epoch 12566 > -23> 2018-08-07 14:10:55.146576 7f9fcebce700 20 osd.0 12559 > share_map_peer 0x56308abb9800 already has epoch 12566 > -22> 2018-08-07 14:10:55.146590 7f9fcf3cf700 20 osd.0 12559 > share_map_peer 0x56308abb1000 already has epoch 12566 > -21> 2018-08-07 14:10:55.146600 7f9fce3cd700 10 osd.0 12560 > handle_osd_ping osd.15 10.12.125.3:6813/49064793 says i am down in 12566 > -20> 2018-08-07 14:10:55.146609 7f9fcebce700 10 osd.0 12560 > handle_osd_ping osd.16 10.12.3.17:6801/1018363 says i am down in 12566 > -19> 2018-08-07 14:10:55.146619 7f9fcf3cf700 10 osd.0 12560 > handle_osd_ping osd.11 10.12.3.16:6812/19232 says i am down in 12566 > -18> 2018-08-07 14:10:55.146643 7f9fcf3cf700 20 osd.0 12559 > share_map_peer 0x56308a9d0000 already has epoch 12566 > -17> 2018-08-07 14:10:55.146653 7f9fcf3cf700 10 osd.0 12560 > handle_osd_ping osd.15 10.12.3.17:6812/49064793 says i am down in 12566 > -16> 2018-08-07 14:10:55.448468 7f9fcabdd700 10 osd.0 12560 > tick_without_osd_lock > -15> 2018-08-07 14:10:55.448491 7f9fcabdd700 20 osd.0 12559 > can_inc_scrubs_pending 0 -> 1 (max 1, active 0) > -14> 2018-08-07 14:10:55.448497 7f9fcabdd700 20 osd.0 12560 > scrub_time_permit should run between 0 - 24 now 14 = yes > -13> 2018-08-07 14:10:55.448525 7f9fcabdd700 20 osd.0 12560 > scrub_load_below_threshold loadavg 2.31 < daily_loadavg 2.68855 and < 15m > avg 2.63 = yes > -12> 2018-08-07 14:10:55.448535 7f9fcabdd700 20 osd.0 12560 sched_scrub > load_is_low=1 > -11> 2018-08-07 14:10:55.448555 7f9fcabdd700 10 osd.0 12560 sched_scrub > 15.112 scheduled at 2018-08-07 15:03:15.052952 > 2018-08-07 14:10:55.448494 > -10> 2018-08-07 14:10:55.448563 7f9fcabdd700 20 osd.0 12560 sched_scrub > done > -9> 2018-08-07 14:10:55.448565 7f9fcabdd700 10 osd.0 12559 > promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 bytes; > target 25 obj/sec or 5120 k bytes/sec > -8> 2018-08-07 14:10:55.448568 7f9fcabdd700 20 osd.0 12559 > promote_throttle_recalibrate new_prob 1000 > -7> 2018-08-07 14:10:55.448569 7f9fcabdd700 10 osd.0 12559 > promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted > new_prob 1000, prob 1000 -> 1000 > -6> 2018-08-07 14:10:55.507159 7f9faab9d700 20 osd.0 op_wq(5) _process > empty q, waiting > -5> 2018-08-07 14:10:55.812434 7f9fb5bb3700 20 osd.0 op_wq(7) _process > empty q, waiting > -4> 2018-08-07 14:10:56.236584 7f9fcd42e700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f9fa7396700' had timed out after 60 > -3> 2018-08-07 14:10:56.236618 7f9fcd42e700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7f9fb33ae700' had timed out after 60 > -2> 2018-08-07 14:10:56.236621 7f9fcd42e700 1 heartbeat_map is_healthy > 'OSD::peering_tp thread 0x7f9fba3bc700' had timed out after 15 > -1> 2018-08-07 14:10:56.236640 7f9fcd42e700 1 heartbeat_map is_healthy > 'OSD::peering_tp thread 0x7f9fba3bc700' had suicide timed out after 150 > 0> 2018-08-07 14:10:56.245420 7f9fba3bc700 -1 *** Caught signal > (Aborted) ** > in thread 7f9fba3bc700 thread_name:tp_peering > > THe osd cyclically crashes and comes back up. I tried modifying the recovery > etc timeouts, but no luck - the situation is still the same. Regarding the > radosgw, across all nodes, after starting the rgw process, i only get this: > > 2018-08-07 14:32:17.852785 7f482dcaf700 2 > RGWDataChangesLog::ChangesRenewThread: start > > I found this thread in the ceph mailing list > (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018956.html) > but I'm not sure if this is the same thing(albeit, it's the same error), as > I don't use s3 acls/expiration in my cluster(if it's set to a default, I'm > not aware of it) > > > > > On 06/08/18 16:30, Josef Zelenka wrote: >> >> Hi, >> >> i'm running a cluster on Luminous(12.2.5), Ubuntu 16.04 - configuration is >> 3 nodes, 6 drives each(though i have encountered this on a different >> cluster, similar hardware, only the drives were HDD instead of SSD - same >> usage). I have recently seen a bug(?) where one of the OSDs suddenly spikes >> in iops and constantly restarts(trying to load the journal/filemap >> apparently) which renders the radosgw(primary usage of this cluster) unable >> to write. The only thing that helps here is stopping the OSD, but that helps >> only until another one does the similar thing. Any clue on the cause of >> this? LOgs of the osd when it crashes below. THanks >> >> Josef >> >> -9920> 2018-08-06 12:12:10.588227 7f8e7afcb700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60 >> -9919> 2018-08-06 12:12:10.607070 7f8e7a7ca700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60 >> -- >> -1> 2018-08-06 14:12:52.428994 7f8e7982b700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had suicide timed out >> after 150 >> 0> 2018-08-06 14:12:52.432088 7f8e56f9a700 -1 *** Caught signal >> (Aborted) ** >> in thread 7f8e56f9a700 thread_name:tp_osd_tp >> >> ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous >> (stable) >> 1: (()+0xa7cab4) [0x55868269aab4] >> 2: (()+0x11390) [0x7f8e7e51d390] >> 3: (()+0x1026d) [0x7f8e7e51c26d] >> 4: (pthread_mutex_lock()+0x7d) [0x7f8e7e515dbd] >> 5: (Mutex::Lock(bool)+0x49) [0x5586826bb899] >> 6: (PG::lock(bool) const+0x33) [0x55868216ace3] >> 7: (OSD::ShardedOpWQ::_process(unsigned int, >> ceph::heartbeat_handle_d*)+0x844) [0x558682101044] >> 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) >> [0x5586826e27f4] >> 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5586826e5830] >> 10: (()+0x76ba) [0x7f8e7e5136ba] >> 11: (clone()+0x6d) [0x7f8e7d58a41d] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed >> to interpret this. >> >> --- logging levels --- >> 0/ 5 none >> 0/ 0 lockdep >> 0/ 0 context >> 0/ 0 crush >> 1/ 5 mds >> 1/ 5 mds_balancer >> 1/ 5 mds_locker >> 1/ 5 mds_log >> 1/ 5 mds_log_expire >> 1/ 5 mds_migrator >> 0/ 0 buffer >> 0/ 0 timer >> 0/ 0 filer >> 0/ 1 striper >> 0/ 0 objecter >> 0/ 0 rados >> 0/ 0 rbd >> 0/ 5 rbd_mirror >> 0/ 5 rbd_replay >> 0/ 0 journaler >> 0/ 0 objectcacher >> 0/ 0 client >> 0/ 0 osd >> 0/ 0 optracker >> 0/ 0 objclass >> 0/ 0 filestore >> 0/ 0 journal >> 0/ 0 ms >> 0/ 0 mon >> 0/ 0 monc >> 0/ 0 paxos >> 0/ 0 tp >> 0/ 0 auth >> 1/ 5 crypto >> 0/ 0 finisher >> 1/ 1 reserver >> 1/ 5 heartbeatmap >> 0/ 0 perfcounter >> 0/ 0 rgw >> 1/10 civetweb >> 1/ 5 javaclient >> 0/ 0 asok >> 0/ 0 throttle >> 0/ 0 refs >> 1/ 5 xio >> 1/ 5 compressor >> 1/ 5 bluestore >> 1/ 5 bluefs >> 1/ 3 bdev >> 1/ 5 kstore >> 4/ 5 rocksdb >> 4/ 5 leveldb >> 4/ 5 memdb >> 1/ 5 kinetic >> 1/ 5 fuse >> 1/ 5 mgr >> 1/ 5 mgrc >> 1/ 5 dpdk >> 1/ 5 eventtrace >> -2/-2 (syslog threshold) >> -1/-1 (stderr threshold) >> max_recent 10000 >> max_new 1000 >> log_file /var/log/ceph/ceph-osd.7.log >> --- end dump of recent events --- >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com