One OSD fails (slow requests, high cpu, termination)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I just noticed a strange behavior on one OSD (and only one, other OSDs on the same server didn’t show that behavior) in a ceph-cluster (all 0.94.2 on Debian 7 with a self-made 4.1 Kernel).
The OSD started to accumulate slow requests, a restart didn’t help.

After a few seconds the log is filled with lines like these:
   -91> 2015-07-20 21:55:03.537385 7f9e20ec3700  0 -- [<OwnIPv6>]:6814/1376041 >> [<OwnIPv6>]:0/2078381 pipe(0x5396f000 sd=16371 :6814 s=0 pgs=0 cs=0 l=1 c=0x538e7340).accept replacing existing (lossy) channel (new one lossy=1)
(Full example after startup https://paste.ee/p/HfTlp )

With nearly 100% CPU usage.

After some time the slow requests accumulate so I restart the OSD, if I wait longer I observed a termination at the end (longer version: https://paste.ee/p/XvD0o ):

    -6> 2015-07-20 21:55:03.729709 7f9e1681d700  0 -- [<OwnIPv6>]:6814/1376041 >> [<OwnIPv6>]:0/2078381 pipe(0x53d5a000 sd=16454 :6814 s=0 pgs=0 cs=0 l=1 c=0x53cf7600).accept replacing existing (lossy) channel (new one lossy=1)
    -5> 2015-07-20 21:55:03.737393 7fa637a5c700 -1 osd.9 31469 heartbeat_check: no reply from osd.32 since back 2015-07-20 21:53:08.918692 front 2015-07-20 21:53:56.149747 (cutoff 2015-07-20 21:54:43.737387)
    -4> 2015-07-20 21:55:03.737433 7fa637a5c700 -1 osd.9 31469 heartbeat_check: no reply from osd.33 since back 2015-07-20 21:54:34.759924 front 2015-07-20 21:53:46.235158 (cutoff 2015-07-20 21:54:43.737387)
    -3> 2015-07-20 21:55:03.737443 7fa637a5c700 -1 osd.9 31469 heartbeat_check: no reply from osd.35 since back 2015-07-20 21:54:20.657821 front 2015-07-20 21:54:20.657821 (cutoff 2015-07-20 21:54:43.737387)
    -2> 2015-07-20 21:55:03.737689 7fa637a5c700  0 log_channel(cluster) log [WRN] : 80 slow requests, 1 included below; oldest blocked for > 79.872208 secs
    -1> 2015-07-20 21:55:03.737700 7fa637a5c700  0 log_channel(cluster) log [WRN] : slow request 36.802253 seconds old, received at 2015-07-20 21:54:26.935372: osd_op(client.1024363.0:6627934 rbd_header.79e8074b0dc51 [watch reconnect cookie 94636449644720 gen 97842] 1.e7fceb98 ondisk+write+known_if_redirected e31467) currently no flag points reached
     0> 2015-07-20 21:55:03.744057 7fa628898700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fa628898700 time 2015-07-20 21:55:03.730601
common/Thread.cc: 129: FAILED assert(ret == 0)

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xcdb572]
 2: /usr/bin/ceph-osd() [0xcc236f]
 3: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0xcb903f]
 4: (Accepter::entry()+0x342) [0xd71b22]
 5: (()+0x6b50) [0x7fa63fa88b50]
 6: (clone()+0x6d) [0x7fa63e4a495d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 5 rbd_replay
   0/ 0 journaler
   0/ 5 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   1/ 3 keyvaluestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/10 civetweb
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---
2015-07-20 21:55:03.940097 7fa628898700 -1 *** Caught signal (Aborted) **
 in thread 7fa628898700

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: /usr/bin/ceph-osd() [0xbef08c]
 2: (()+0xf0a0) [0x7fa63fa910a0]
 3: (gsignal()+0x35) [0x7fa63e3fb165]
 4: (abort()+0x180) [0x7fa63e3fe3e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fa63ec5189d]
 6: (()+0x63996) [0x7fa63ec4f996]
 7: (()+0x639c3) [0x7fa63ec4f9c3]
 8: (()+0x63bee) [0x7fa63ec4fbee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xcdb720]
 10: /usr/bin/ceph-osd() [0xcc236f]
 11: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0xcb903f]
 12: (Accepter::entry()+0x342) [0xd71b22]
 13: (()+0x6b50) [0x7fa63fa88b50]
 14: (clone()+0x6d) [0x7fa63e4a495d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2015-07-20 21:55:03.940097 7fa628898700 -1 *** Caught signal (Aborted) **
 in thread 7fa628898700

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: /usr/bin/ceph-osd() [0xbef08c]
 2: (()+0xf0a0) [0x7fa63fa910a0]
 3: (gsignal()+0x35) [0x7fa63e3fb165]
 4: (abort()+0x180) [0x7fa63e3fe3e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fa63ec5189d]
 6: (()+0x63996) [0x7fa63ec4f996]
 7: (()+0x639c3) [0x7fa63ec4f9c3]
 8: (()+0x63bee) [0x7fa63ec4fbee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xcdb720]
 10: /usr/bin/ceph-osd() [0xcc236f]
 11: (SimpleMessenger::add_accept_pipe(int)+0x6f) [0xcb903f]
 12: (Accepter::entry()+0x342) [0xd71b22]
 13: (()+0x6b50) [0x7fa63fa88b50]
 14: (clone()+0x6d) [0x7fa63e4a495d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 5 rbd_replay
   0/ 0 journaler
   0/ 5 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   1/ 3 keyvaluestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/10 civetweb
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events —

Any ideas how to fix this? (or shall I stop the OSD, format drive and create a new OSD).


greetings

Johannes

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux