Re: OSD had suicide timed out

Josef Zelenka <josef.zelenka@xxxxxxxxxxxxxxxx> · Tue, 7 Aug 2018 14:34:40 +0200

To follow up, I did some further digging with debug_osd=20/20 and it 
appears as if there's no traffic to the OSD, even though it comes UP for 
the cluster (this started happening on another OSD in the cluster today, 
same stuff):

   -27> 2018-08-07 14:10:55.146531 7f9fce3cd700 10 osd.0 12560 
handle_osd_ping osd.17 10.12.3.17:6811/19661 says i am down in 12566
   -26> 2018-08-07 14:10:55.146542 7f9fcebce700 10 osd.0 12560 
handle_osd_ping osd.12 10.12.125.3:6807/4624236 says i am down in 12566
   -25> 2018-08-07 14:10:55.146551 7f9fcf3cf700 10 osd.0 12560 
handle_osd_ping osd.13 10.12.3.17:6805/186262 says i am down in 12566
   -24> 2018-08-07 14:10:55.146564 7f9fce3cd700 20 osd.0 12559 
share_map_peer 0x56308a9d0000 already has epoch 12566
   -23> 2018-08-07 14:10:55.146576 7f9fcebce700 20 osd.0 12559 
share_map_peer 0x56308abb9800 already has epoch 12566
   -22> 2018-08-07 14:10:55.146590 7f9fcf3cf700 20 osd.0 12559 
share_map_peer 0x56308abb1000 already has epoch 12566
   -21> 2018-08-07 14:10:55.146600 7f9fce3cd700 10 osd.0 12560 
handle_osd_ping osd.15 10.12.125.3:6813/49064793 says i am down in 12566
   -20> 2018-08-07 14:10:55.146609 7f9fcebce700 10 osd.0 12560 
handle_osd_ping osd.16 10.12.3.17:6801/1018363 says i am down in 12566
   -19> 2018-08-07 14:10:55.146619 7f9fcf3cf700 10 osd.0 12560 
handle_osd_ping osd.11 10.12.3.16:6812/19232 says i am down in 12566
   -18> 2018-08-07 14:10:55.146643 7f9fcf3cf700 20 osd.0 12559 
share_map_peer 0x56308a9d0000 already has epoch 12566
   -17> 2018-08-07 14:10:55.146653 7f9fcf3cf700 10 osd.0 12560 
handle_osd_ping osd.15 10.12.3.17:6812/49064793 says i am down in 12566
   -16> 2018-08-07 14:10:55.448468 7f9fcabdd700 10 osd.0 12560 
tick_without_osd_lock
   -15> 2018-08-07 14:10:55.448491 7f9fcabdd700 20 osd.0 12559 
can_inc_scrubs_pending 0 -> 1 (max 1, active 0)
   -14> 2018-08-07 14:10:55.448497 7f9fcabdd700 20 osd.0 12560 
scrub_time_permit should run between 0 - 24 now 14 = yes
   -13> 2018-08-07 14:10:55.448525 7f9fcabdd700 20 osd.0 12560 
scrub_load_below_threshold loadavg 2.31 < daily_loadavg 2.68855 and < 
15m avg 2.63 = yes
   -12> 2018-08-07 14:10:55.448535 7f9fcabdd700 20 osd.0 12560 
sched_scrub load_is_low=1
   -11> 2018-08-07 14:10:55.448555 7f9fcabdd700 10 osd.0 12560 
sched_scrub 15.112 scheduled at 2018-08-07 15:03:15.052952 > 2018-08-07 
14:10:55.448494
   -10> 2018-08-07 14:10:55.448563 7f9fcabdd700 20 osd.0 12560 
sched_scrub done
    -9> 2018-08-07 14:10:55.448565 7f9fcabdd700 10 osd.0 12559 
promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 bytes; 
target 25 obj/sec or 5120 k bytes/sec
    -8> 2018-08-07 14:10:55.448568 7f9fcabdd700 20 osd.0 12559 
promote_throttle_recalibrate  new_prob 1000
    -7> 2018-08-07 14:10:55.448569 7f9fcabdd700 10 osd.0 12559 
promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted 
new_prob 1000, prob 1000 -> 1000
    -6> 2018-08-07 14:10:55.507159 7f9faab9d700 20 osd.0 op_wq(5) 
_process empty q, waiting
    -5> 2018-08-07 14:10:55.812434 7f9fb5bb3700 20 osd.0 op_wq(7) 
_process empty q, waiting
    -4> 2018-08-07 14:10:56.236584 7f9fcd42e700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7f9fa7396700' had timed out after 60
    -3> 2018-08-07 14:10:56.236618 7f9fcd42e700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7f9fb33ae700' had timed out after 60
    -2> 2018-08-07 14:10:56.236621 7f9fcd42e700  1 heartbeat_map 
is_healthy 'OSD::peering_tp thread 0x7f9fba3bc700' had timed out after 15
    -1> 2018-08-07 14:10:56.236640 7f9fcd42e700  1 heartbeat_map 
is_healthy 'OSD::peering_tp thread 0x7f9fba3bc700' had suicide timed out 
after 150
     0> 2018-08-07 14:10:56.245420 7f9fba3bc700 -1 *** Caught signal 
(Aborted) **
 in thread 7f9fba3bc700 thread_name:tp_peering

THe osd cyclically crashes and comes back up. I tried modifying the 
recovery etc timeouts, but no luck - the situation is still the same. 
Regarding the radosgw, across all nodes, after starting the rgw process, 
i only get this:

2018-08-07 14:32:17.852785 7f482dcaf700  2 
RGWDataChangesLog::ChangesRenewThread: start

I found this thread in the ceph mailing list 
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018956.html) 
but I'm not sure if this is the same thing(albeit, it's the same error), 
as I don't use s3 acls/expiration in my cluster(if it's set to a 
default, I'm not aware of it)

On 06/08/18 16:30, Josef Zelenka wrote:
Hi,

i'm running a cluster on Luminous(12.2.5), Ubuntu 16.04 - 
configuration is 3 nodes, 6 drives each(though i have encountered this 
on a different cluster, similar hardware, only the drives were HDD 
instead of SSD - same usage). I have recently seen a bug(?) where one 
of the OSDs suddenly spikes in iops and constantly restarts(trying to 
load the journal/filemap apparently) which renders the radosgw(primary 
usage of this cluster) unable to write. The only thing that helps here 
is stopping the OSD, but that helps only until another one does the 
similar thing. Any clue on the cause of this? LOgs of the osd when it 
crashes below. THanks

Josef

 -9920> 2018-08-06 12:12:10.588227 7f8e7afcb700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60
 -9919> 2018-08-06 12:12:10.607070 7f8e7a7ca700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60
--
    -1> 2018-08-06 14:12:52.428994 7f8e7982b700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had suicide timed 
out after 150
     0> 2018-08-06 14:12:52.432088 7f8e56f9a700 -1 *** Caught signal 
(Aborted) **
 in thread 7f8e56f9a700 thread_name:tp_osd_tp

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) 
luminous (stable)
 1: (()+0xa7cab4) [0x55868269aab4]
 2: (()+0x11390) [0x7f8e7e51d390]
 3: (()+0x1026d) [0x7f8e7e51c26d]
 4: (pthread_mutex_lock()+0x7d) [0x7f8e7e515dbd]
 5: (Mutex::Lock(bool)+0x49) [0x5586826bb899]
 6: (PG::lock(bool) const+0x33) [0x55868216ace3]
 7: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x844) [0x558682101044]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) 
[0x5586826e27f4]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5586826e5830]
 10: (()+0x76ba) [0x7f8e7e5136ba]
 11: (clone()+0x6d) [0x7f8e7d58a41d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/10 civetweb
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.7.log
--- end dump of recent events ---

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com