Re: OSD crashed today in os/JournalingObjectStore.cc

Stefan Priebe <s.priebe@xxxxxxxxxxxx> · Wed, 05 Dec 2012 23:29:15 +0100

Hello,

this seems to happens since:
85574a3

Stefan

Am 05.12.2012 23:25, schrieb Stefan Priebe:
Hello,

i had now 8 OSDs failing again with the same error.

      0> 2012-12-05 23:10:41.213149 7f7fad109700 -1
os/JournalingObjectStore.cc: In function 'uint64_t
JournalingObjectStore::ApplyManager::op_apply_start(uint64_t)' thread
7f7fad109700 time 2012-12-05 23:10:41.212454
os/JournalingObjectStore.cc: 134: FAILED assert(op > committed_seq)

  ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
  1: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned
long)+0x816) [0x747626]
  2: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
  3: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
  4: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
  5: (()+0x68ca) [0x7f7fc17a78ca]
  6: (clone()+0x6d) [0x7f7fbfc16bfd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 0 lockdep
    0/ 0 context
    0/ 0 crush
    1/ 5 mds
    1/ 5 mds_balancer
    1/ 5 mds_locker
    1/ 5 mds_log
    1/ 5 mds_log_expire
    1/ 5 mds_migrator
    0/ 0 buffer
    0/ 0 timer
    0/ 1 filer
    0/ 1 striper
    0/ 1 objecter
    0/ 5 rados
    0/ 5 rbd
    0/ 0 journaler
    0/ 5 objectcacher
   0/ 5 client
    0/ 0 osd
    0/ 0 optracker
    0/ 0 objclass
    0/ 0 filestore
    0/ 0 journal
    0/ 0 ms
    1/ 5 mon
    0/ 0 monc
    0/ 5 paxos
    0/ 0 tp
    0/ 0 auth
    1/ 5 crypto
    0/ 0 finisher
    0/ 0 heartbeatmap
    0/ 0 perfcounter
    1/ 5 rgw
    1/ 5 hadoop
    1/ 5 rgw
    1/ 5 hadoop
    1/ 5 javaclient
    0/ 0 asok
    0/ 0 throttle
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent    100000
   max_new         1000
   log_file /var/log/ceph/ceph-osd.13.log
--- end dump of recent events ---
2012-12-05 23:10:41.216011 7f7fad109700 -1 *** Caught signal (Aborted) **
  in thread 7f7fad109700

  ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
  1: /usr/bin/ceph-osd() [0x797bd9]
  2: (()+0xeff0) [0x7f7fc17afff0]
  3: (gsignal()+0x35) [0x7f7fbfb79215]
  4: (abort()+0x180) [0x7f7fbfb7c020]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7fc040ddc5]
  6: (()+0xcb166) [0x7f7fc040c166]
  7: (()+0xcb193) [0x7f7fc040c193]
  8: (()+0xcb28e) [0x7f7fc040c28e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7c9) [0x7fb939]
  10: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned
long)+0x816) [0x747626]
  11: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
  12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
  13: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
  14: (()+0x68ca) [0x7f7fc17a78ca]
  15: (clone()+0x6d) [0x7f7fbfc16bfd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- begin dump of recent events ---
      0> 2012-12-05 23:10:41.216011 7f7fad109700 -1 *** Caught signal
(Aborted) **
  in thread 7f7fad109700

  ceph version 0.55-142-g22f794d (22f794da074dd1b3221c484a5ae05b2ff1bd0fa4)
  1: /usr/bin/ceph-osd() [0x797bd9]
  2: (()+0xeff0) [0x7f7fc17afff0]
  3: (gsignal()+0x35) [0x7f7fbfb79215]
  4: (abort()+0x180) [0x7f7fbfb7c020]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7fc040ddc5]
  6: (()+0xcb166) [0x7f7fc040c166]
  7: (()+0xcb193) [0x7f7fc040c193]
  8: (()+0xcb28e) [0x7f7fc040c28e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7c9) [0x7fb939]
  10: (JournalingObjectStore::ApplyManager::op_apply_start(unsigned
long)+0x816) [0x747626]
  11: (FileStore::_do_op(FileStore::OpSequencer*)+0x52) [0x703c22]
  12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x82f81b]
  13: (ThreadPool::WorkThread::entry()+0x10) [0x832000]
  14: (()+0x68ca) [0x7f7fc17a78ca]
  15: (clone()+0x6d) [0x7f7fbfc16bfd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 0 lockdep
    0/ 0 context
    0/ 0 crush
    1/ 5 mds
    1/ 5 mds_balancer
    1/ 5 mds_locker
    1/ 5 mds_log
    1/ 5 mds_log_expire
    1/ 5 mds_migrator
    0/ 0 buffer
    0/ 0 timer
    0/ 1 filer
    0/ 1 striper
    0/ 1 objecter
    0/ 5 rados
    0/ 5 rbd
    0/ 0 journaler
    0/ 5 objectcacher
    0/ 5 client
    0/ 0 osd
    0/ 0 optracker
    0/ 0 objclass
    0/ 0 filestore
    0/ 0 journal
    0/ 0 ms
    1/ 5 mon
    0/ 0 monc
    0/ 5 paxos
    0/ 0 tp
    0/ 0 auth
    1/ 5 crypto
    0/ 0 finisher
    0/ 0 heartbeatmap
    0/ 0 perfcounter
    1/ 5 rgw
    1/ 5 hadoop
    1/ 5 javaclient
    0/ 0 asok
    0/ 0 throttle
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent    100000
   max_new         1000
   log_file /var/log/ceph/ceph-osd.13.log
--- end dump of recent events ---

Stefan
Am 05.12.2012 17:05, schrieb Stefan Priebe - Profihost AG:
There was a dump in the attached log.

Stefan

Am 05.12.2012 um 15:41 schrieb Sage Weil <sage@xxxxxxxxxxx>:

On Wed, 5 Dec 2012, Stefan Priebe - Profihost AG wrote:
Hello list,

i updated to latest next from today and then after 20 minutes an OSD
was
crashing in os/JournalingObjectStore.cc.

Attached is the log.

Hmm, this is perplexing.  It might just be a bad assert, but I can't see
how it could happen.  Any chance you can reproduce with

    debug journal = 0/10

in the [osd] section?  That will give us a dump if it fails the assert.

Thanks!
s
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html