OSD had suicide timed out

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

i'm running a cluster on Luminous(12.2.5), Ubuntu 16.04 - configuration is 3 nodes, 6 drives each(though i have encountered this on a different cluster, similar hardware, only the drives were HDD instead of SSD - same usage). I have recently seen a bug(?) where one of the OSDs suddenly spikes in iops and constantly restarts(trying to load the journal/filemap apparently) which renders the radosgw(primary usage of this cluster) unable to write. The only thing that helps here is stopping the OSD, but that helps only until another one does the similar thing. Any clue on the cause of this? LOgs of the osd when it crashes below. THanks

Josef

 -9920> 2018-08-06 12:12:10.588227 7f8e7afcb700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60  -9919> 2018-08-06 12:12:10.607070 7f8e7a7ca700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60
--
    -1> 2018-08-06 14:12:52.428994 7f8e7982b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had suicide timed out after 150      0> 2018-08-06 14:12:52.432088 7f8e56f9a700 -1 *** Caught signal (Aborted) **
 in thread 7f8e56f9a700 thread_name:tp_osd_tp

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
 1: (()+0xa7cab4) [0x55868269aab4]
 2: (()+0x11390) [0x7f8e7e51d390]
 3: (()+0x1026d) [0x7f8e7e51c26d]
 4: (pthread_mutex_lock()+0x7d) [0x7f8e7e515dbd]
 5: (Mutex::Lock(bool)+0x49) [0x5586826bb899]
 6: (PG::lock(bool) const+0x33) [0x55868216ace3]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x844) [0x558682101044]  8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x5586826e27f4]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5586826e5830]
 10: (()+0x76ba) [0x7f8e7e5136ba]
 11: (clone()+0x6d) [0x7f8e7d58a41d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/10 civetweb
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.7.log
--- end dump of recent events ---

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux