Re: OSD had suicide timed out

Brad Hubbard <bhubbard@xxxxxxxxxx> · Wed, 8 Aug 2018 09:51:06 +1000

Try to work out why the other osds are saying this one is down. Is it
because this osd is too busy to respond or something else.

debug_ms = 1 will show you some message debugging which may help.

On Tue, Aug 7, 2018 at 10:34 PM, Josef Zelenka
<josef.zelenka@xxxxxxxxxxxxxxxx> wrote:
> To follow up, I did some further digging with debug_osd=20/20 and it appears
> as if there's no traffic to the OSD, even though it comes UP for the cluster
> (this started happening on another OSD in the cluster today, same stuff):
>
>    -27> 2018-08-07 14:10:55.146531 7f9fce3cd700 10 osd.0 12560
> handle_osd_ping osd.17 10.12.3.17:6811/19661 says i am down in 12566
>    -26> 2018-08-07 14:10:55.146542 7f9fcebce700 10 osd.0 12560
> handle_osd_ping osd.12 10.12.125.3:6807/4624236 says i am down in 12566
>    -25> 2018-08-07 14:10:55.146551 7f9fcf3cf700 10 osd.0 12560
> handle_osd_ping osd.13 10.12.3.17:6805/186262 says i am down in 12566
>    -24> 2018-08-07 14:10:55.146564 7f9fce3cd700 20 osd.0 12559
> share_map_peer 0x56308a9d0000 already has epoch 12566
>    -23> 2018-08-07 14:10:55.146576 7f9fcebce700 20 osd.0 12559
> share_map_peer 0x56308abb9800 already has epoch 12566
>    -22> 2018-08-07 14:10:55.146590 7f9fcf3cf700 20 osd.0 12559
> share_map_peer 0x56308abb1000 already has epoch 12566
>    -21> 2018-08-07 14:10:55.146600 7f9fce3cd700 10 osd.0 12560
> handle_osd_ping osd.15 10.12.125.3:6813/49064793 says i am down in 12566
>    -20> 2018-08-07 14:10:55.146609 7f9fcebce700 10 osd.0 12560
> handle_osd_ping osd.16 10.12.3.17:6801/1018363 says i am down in 12566
>    -19> 2018-08-07 14:10:55.146619 7f9fcf3cf700 10 osd.0 12560
> handle_osd_ping osd.11 10.12.3.16:6812/19232 says i am down in 12566
>    -18> 2018-08-07 14:10:55.146643 7f9fcf3cf700 20 osd.0 12559
> share_map_peer 0x56308a9d0000 already has epoch 12566
>    -17> 2018-08-07 14:10:55.146653 7f9fcf3cf700 10 osd.0 12560
> handle_osd_ping osd.15 10.12.3.17:6812/49064793 says i am down in 12566
>    -16> 2018-08-07 14:10:55.448468 7f9fcabdd700 10 osd.0 12560
> tick_without_osd_lock
>    -15> 2018-08-07 14:10:55.448491 7f9fcabdd700 20 osd.0 12559
> can_inc_scrubs_pending 0 -> 1 (max 1, active 0)
>    -14> 2018-08-07 14:10:55.448497 7f9fcabdd700 20 osd.0 12560
> scrub_time_permit should run between 0 - 24 now 14 = yes
>    -13> 2018-08-07 14:10:55.448525 7f9fcabdd700 20 osd.0 12560
> scrub_load_below_threshold loadavg 2.31 < daily_loadavg 2.68855 and < 15m
> avg 2.63 = yes
>    -12> 2018-08-07 14:10:55.448535 7f9fcabdd700 20 osd.0 12560 sched_scrub
> load_is_low=1
>    -11> 2018-08-07 14:10:55.448555 7f9fcabdd700 10 osd.0 12560 sched_scrub
> 15.112 scheduled at 2018-08-07 15:03:15.052952 > 2018-08-07 14:10:55.448494
>    -10> 2018-08-07 14:10:55.448563 7f9fcabdd700 20 osd.0 12560 sched_scrub
> done
>     -9> 2018-08-07 14:10:55.448565 7f9fcabdd700 10 osd.0 12559
> promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 bytes;
> target 25 obj/sec or 5120 k bytes/sec
>     -8> 2018-08-07 14:10:55.448568 7f9fcabdd700 20 osd.0 12559
> promote_throttle_recalibrate  new_prob 1000
>     -7> 2018-08-07 14:10:55.448569 7f9fcabdd700 10 osd.0 12559
> promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted
> new_prob 1000, prob 1000 -> 1000
>     -6> 2018-08-07 14:10:55.507159 7f9faab9d700 20 osd.0 op_wq(5) _process
> empty q, waiting
>     -5> 2018-08-07 14:10:55.812434 7f9fb5bb3700 20 osd.0 op_wq(7) _process
> empty q, waiting
>     -4> 2018-08-07 14:10:56.236584 7f9fcd42e700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f9fa7396700' had timed out after 60
>     -3> 2018-08-07 14:10:56.236618 7f9fcd42e700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f9fb33ae700' had timed out after 60
>     -2> 2018-08-07 14:10:56.236621 7f9fcd42e700  1 heartbeat_map is_healthy
> 'OSD::peering_tp thread 0x7f9fba3bc700' had timed out after 15
>     -1> 2018-08-07 14:10:56.236640 7f9fcd42e700  1 heartbeat_map is_healthy
> 'OSD::peering_tp thread 0x7f9fba3bc700' had suicide timed out after 150
>      0> 2018-08-07 14:10:56.245420 7f9fba3bc700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f9fba3bc700 thread_name:tp_peering
>
> THe osd cyclically crashes and comes back up. I tried modifying the recovery
> etc timeouts, but no luck - the situation is still the same. Regarding the
> radosgw, across all nodes, after starting the rgw process, i only get this:
>
> 2018-08-07 14:32:17.852785 7f482dcaf700  2
> RGWDataChangesLog::ChangesRenewThread: start
>
> I found this thread in the ceph mailing list
> (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018956.html)
> but I'm not sure if this is the same thing(albeit, it's the same error), as
> I don't use s3 acls/expiration in my cluster(if it's set to a default, I'm
> not aware of it)
>
>
>
>
> On 06/08/18 16:30, Josef Zelenka wrote:
>>
>> Hi,
>>
>> i'm running a cluster on Luminous(12.2.5), Ubuntu 16.04 - configuration is
>> 3 nodes, 6 drives each(though i have encountered this on a different
>> cluster, similar hardware, only the drives were HDD instead of SSD - same
>> usage). I have recently seen a bug(?) where one of the OSDs suddenly spikes
>> in iops and constantly restarts(trying to load the journal/filemap
>> apparently) which renders the radosgw(primary usage of this cluster) unable
>> to write. The only thing that helps here is stopping the OSD, but that helps
>> only until another one does the similar thing. Any clue on the cause of
>> this? LOgs of the osd when it crashes below. THanks
>>
>> Josef
>>
>>  -9920> 2018-08-06 12:12:10.588227 7f8e7afcb700  1 heartbeat_map
>> is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60
>>  -9919> 2018-08-06 12:12:10.607070 7f8e7a7ca700  1 heartbeat_map
>> is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60
>> --
>>     -1> 2018-08-06 14:12:52.428994 7f8e7982b700  1 heartbeat_map
>> is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had suicide timed out
>> after 150
>>      0> 2018-08-06 14:12:52.432088 7f8e56f9a700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f8e56f9a700 thread_name:tp_osd_tp
>>
>>  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous
>> (stable)
>>  1: (()+0xa7cab4) [0x55868269aab4]
>>  2: (()+0x11390) [0x7f8e7e51d390]
>>  3: (()+0x1026d) [0x7f8e7e51c26d]
>>  4: (pthread_mutex_lock()+0x7d) [0x7f8e7e515dbd]
>>  5: (Mutex::Lock(bool)+0x49) [0x5586826bb899]
>>  6: (PG::lock(bool) const+0x33) [0x55868216ace3]
>>  7: (OSD::ShardedOpWQ::_process(unsigned int,
>> ceph::heartbeat_handle_d*)+0x844) [0x558682101044]
>>  8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884)
>> [0x5586826e27f4]
>>  9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5586826e5830]
>>  10: (()+0x76ba) [0x7f8e7e5136ba]
>>  11: (clone()+0x6d) [0x7f8e7d58a41d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>> --- logging levels ---
>>    0/ 5 none
>>    0/ 0 lockdep
>>    0/ 0 context
>>    0/ 0 crush
>>    1/ 5 mds
>>    1/ 5 mds_balancer
>>    1/ 5 mds_locker
>>    1/ 5 mds_log
>>    1/ 5 mds_log_expire
>>    1/ 5 mds_migrator
>>    0/ 0 buffer
>>    0/ 0 timer
>>    0/ 0 filer
>>    0/ 1 striper
>>    0/ 0 objecter
>>    0/ 0 rados
>>    0/ 0 rbd
>>    0/ 5 rbd_mirror
>>    0/ 5 rbd_replay
>>    0/ 0 journaler
>>    0/ 0 objectcacher
>>    0/ 0 client
>>    0/ 0 osd
>>    0/ 0 optracker
>>    0/ 0 objclass
>>    0/ 0 filestore
>>    0/ 0 journal
>>    0/ 0 ms
>>    0/ 0 mon
>>    0/ 0 monc
>>    0/ 0 paxos
>>    0/ 0 tp
>>    0/ 0 auth
>>    1/ 5 crypto
>>    0/ 0 finisher
>>    1/ 1 reserver
>>    1/ 5 heartbeatmap
>>    0/ 0 perfcounter
>>    0/ 0 rgw
>>    1/10 civetweb
>>    1/ 5 javaclient
>>    0/ 0 asok
>>    0/ 0 throttle
>>    0/ 0 refs
>>    1/ 5 xio
>>    1/ 5 compressor
>>    1/ 5 bluestore
>>    1/ 5 bluefs
>>    1/ 3 bdev
>>    1/ 5 kstore
>>    4/ 5 rocksdb
>>    4/ 5 leveldb
>>    4/ 5 memdb
>>    1/ 5 kinetic
>>    1/ 5 fuse
>>    1/ 5 mgr
>>    1/ 5 mgrc
>>    1/ 5 dpdk
>>    1/ 5 eventtrace
>>   -2/-2 (syslog threshold)
>>   -1/-1 (stderr threshold)
>>   max_recent     10000
>>   max_new         1000
>>   log_file /var/log/ceph/ceph-osd.7.log
>> --- end dump of recent events ---
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com