I don't believe you can re-add an OSD after `ceph osd rm`, but it's worth a shot. Let me see what I can do on my dev cluster. What does `ceph osd dump` and `ceph osd tree` say? I want to make sure I'm starting from the same point you are. On Wed, Jul 16, 2014 at 7:39 PM, hjcho616 <hjcho616 at yahoo.com> wrote: > I did a "ceph osd rm" for all three but I didn't do anything else to it > afterwards. Can this be added back? > > Regards, > Hong > > > On Wednesday, July 16, 2014 6:54 PM, Craig Lewis < > clewis at centraldesktop.com> wrote: > > > For some reason you ended up in my spam folder. That might be why you > didn't get any responses. > > > Have you destroyed osd.0, osd.1, and osd.2? If not, try bringing them up > one a time. You might have just one bad disk, which is much better than > 50% of your disks. > > How is the ceph-osd process behaving when it hits the suicide timeout? I > had some problems a while back where the ceph-osd process would startup, > start consuming ~200% CPU for a while, then get stuck using almost exactly > 100% CPU. It would get kicked out of the cluster for being unresponsive, > then suicide. Repeat. If that's happening here, I can suggest some things > to try. > > > > > > On Fri, Jul 11, 2014 at 9:12 PM, hjcho616 <hjcho616 at yahoo.com> wrote: > > I have 2 OSD machines with 3 OSD running on each. One MDS server with 3 > daemons running. Ran cephfs mostly on 0.78. One night we lost power for > split second. MDS1 and OSD2 went down, OSD1 seemed OK, well turns out OSD1 > suffered most. Those two machines rebooted and seemed ok except it had > some inconsistencies. I waited for a while, didn't fix itself. So I > issued 'ceph pg repair pgnum'. It would try some and some OSD would crash. > Tried this for multiple days. Got some PGs fixed... but mostly it would > crash an OSD and stop recovering. dmesg shows something like below. > > > > [ 740.059498] traps: ceph-osd[5279] general protection ip:7f84e75ec75e > sp:7fff00045bc0 error:0 in libtcmalloc.so.4.1.0[7f84e75b3000+4a000] > > and ceph osd log shows something like this. > > -2> 2014-07-09 20:51:01.163571 7fe0f4617700 1 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had timed out after 60 > -1> 2014-07-09 20:51:01.163609 7fe0f4617700 1 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had suicide timed out > after 180 > 0> 2014-07-09 20:51:01.169542 7fe0f4617700 -1 common/HeartbeatMap.cc: > In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, > const char*, time_t)' thread 7fe0f4617700 time 2014-07-09 20:51:01.163642 > common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout") > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > long)+0x2eb) [0xad2cbb] > 2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6] > 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8] > 4: (CephContextServiceThread::entry()+0x13f) [0xb9911f] > 5: (()+0x8062) [0x7fe0f797e062] > 6: (clone()+0x6d) [0x7fe0f62bea3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 keyvaluestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.0.log > --- end dump of recent events --- > 2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal (Aborted) ** > in thread 7fe0f4617700 > > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-osd() [0xaac562] > 2: (()+0xf880) [0x7fe0f7985880] > 3: (gsignal()+0x39) [0x7fe0f620e3a9] > 4: (abort()+0x148) [0x7fe0f62114c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5] > 6: (()+0x5e746) [0x7fe0f6af9746] > 7: (()+0x5e773) [0x7fe0f6af9773] > 8: (()+0x5e9b2) [0x7fe0f6af99b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0xb85b6a] > 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > long)+0x2eb) [0xad2cbb] > 11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6] > 12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8] > 13: (CephContextServiceThread::entry()+0x13f) [0xb9911f] > 14: (()+0x8062) [0x7fe0f797e062] > 15: (clone()+0x6d) [0x7fe0f62bea3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- begin dump of recent events --- > 0> 2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal > (Aborted) ** > in thread 7fe0f4617700 > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-osd() [0xaac562] > 2: (()+0xf880) [0x7fe0f7985880] > 3: (gsignal()+0x39) [0x7fe0f620e3a9] > 4: (abort()+0x148) [0x7fe0f62114c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5] > 6: (()+0x5e746) [0x7fe0f6af9746] > 7: (()+0x5e773) [0x7fe0f6af9773] > 8: (()+0x5e9b2) [0x7fe0f6af99b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0xb85b6a] > 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > long)+0x2eb) [0xad2cbb] > 11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6] > 12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8] > 13: (CephContextServiceThread::entry()+0x13f) [0xb9911f] > 14: (()+0x8062) [0x7fe0f797e062] > 15: (clone()+0x6d) [0x7fe0f62bea3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 keyvaluestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.0.log > --- end dump of recent events --- > > After several attempts at it, osd.2 (which was on OSD1 which survived the > power event) never comes up. Looks like journal was corrupted > > -1> 2014-07-09 20:44:14.992840 7f12256b67c0 -1 journal Unable to read > past sequence 2157634 but header indicates the journal has committed up > through 2157670, journal is corrupt > 0> 2014-07-09 20:44:14.998742 7f12256b67c0 -1 os/FileJournal.cc: In > function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, > bool*)' thread 7f12256b67c0 time 2014-07-09 20:44:14.993082 > os/FileJournal.cc: 1677: FAILED assert(0) > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, > bool*)+0x467) [0xa8d497] > 2: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) [0x9dfebe] > 3: (FileStore::mount()+0x32c9) [0x9b7939] > 4: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa] > 5: (main()+0x2237) [0x730837] > 6: (__libc_start_main()+0xf5) [0x7f12236bdb45] > 7: /usr/bin/ceph-osd() [0x734479] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 keyvaluestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.2.log > --- end dump of recent events --- > 2014-07-09 20:44:15.010090 7f12256b67c0 -1 *** Caught signal (Aborted) ** > in thread 7f12256b67c0 > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-osd() [0xaac562] > 2: (()+0xf880) [0x7f1224e48880] > 3: (gsignal()+0x39) [0x7f12236d13a9] > 4: (abort()+0x148) [0x7f12236d44c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1223fbe5e5] > 6: (()+0x5e746) [0x7f1223fbc746] > 7: (()+0x5e773) [0x7f1223fbc773] > 8: (()+0x5e9b2) [0x7f1223fbc9b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0xb85b6a] > 10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, > bool*)+0x467) [0xa8d497] > 11: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) > [0x9dfebe] > 12: (FileStore::mount()+0x32c9) [0x9b7939] > 13: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa] > 14: (main()+0x2237) [0x730837] > 15: (__libc_start_main()+0xf5) [0x7f12236bdb45] > 16: /usr/bin/ceph-osd() [0x734479] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- begin dump of recent events --- > 0> 2014-07-09 20:44:15.010090 7f12256b67c0 -1 *** Caught signal > (Aborted) ** > in thread 7f12256b67c0 > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-osd() [0xaac562] > 2: (()+0xf880) [0x7f1224e48880] > 3: (gsignal()+0x39) [0x7f12236d13a9] > 4: (abort()+0x148) [0x7f12236d44c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1223fbe5e5] > 6: (()+0x5e746) [0x7f1223fbc746] > 7: (()+0x5e773) [0x7f1223fbc773] > 8: (()+0x5e9b2) [0x7f1223fbc9b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0xb85b6a] > 10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, > bool*)+0x467) [0xa8d497] > 11: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) > [0x9dfebe] > 12: (FileStore::mount()+0x32c9) [0x9b7939] > 13: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa] > 14: (main()+0x2237) [0x730837] > 15: (__libc_start_main()+0xf5) [0x7f12236bdb45] > 16: /usr/bin/ceph-osd() [0x734479] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 keyvaluestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.2.log > --- end dump of recent events --- > > > So I thought maybe upgrading 0.82 would give it a better option at fixing > things... so I did, now not only those OSDs fail (osd.1 is up but with 14M > of memory only... I assume that's broky too), but MDS fails too. > > # /usr/bin/ceph-mds -i MDS1 --pid-file /var/run/ceph/mds.MDS1.pid -c > /etc/ceph/ceph.conf --cluster ceph -f --debug-mds=20 --debug-journaler=10 > starting mds.MDS1 at :/0 > mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread > 7f8e07c21700 time 2014-07-09 21:01:10.190965 > mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 3: (()+0x8062) [0x7f8e0fe3f062] > 4: (clone()+0x6d) [0x7f8e0ebd3a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > 2014-07-09 21:01:10.192936 7f8e07c21700 -1 mds/MDLog.cc: In function 'void > MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09 21:01:10.190965 > mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 3: (()+0x8062) [0x7f8e0fe3f062] > 4: (clone()+0x6d) [0x7f8e0ebd3a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > 0> 2014-07-09 21:01:10.192936 7f8e07c21700 -1 mds/MDLog.cc: In > function 'void MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09 > 21:01:10.190965 > mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 3: (()+0x8062) [0x7f8e0fe3f062] > 4: (clone()+0x6d) [0x7f8e0ebd3a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > terminate called after throwing an instance of 'ceph::FailedAssertion' > *** Caught signal (Aborted) ** > in thread 7f8e07c21700 > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-mds() [0x8d81f2] > 2: (()+0xf880) [0x7f8e0fe46880] > 3: (gsignal()+0x39) [0x7f8e0eb233a9] > 4: (abort()+0x148) [0x7f8e0eb264c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5] > 6: (()+0x5e746) [0x7f8e0f40e746] > 7: (()+0x5e773) [0x7f8e0f40e773] > 8: (()+0x5e9b2) [0x7f8e0f40e9b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0x9ab5da] > 10: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 11: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 12: (()+0x8062) [0x7f8e0fe3f062] > 13: (clone()+0x6d) [0x7f8e0ebd3a3d] > 2014-07-09 21:01:10.201968 7f8e07c21700 -1 *** Caught signal (Aborted) ** > in thread 7f8e07c21700 > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-mds() [0x8d81f2] > 2: (()+0xf880) [0x7f8e0fe46880] > 3: (gsignal()+0x39) [0x7f8e0eb233a9] > 4: (abort()+0x148) [0x7f8e0eb264c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5] > 6: (()+0x5e746) [0x7f8e0f40e746] > 7: (()+0x5e773) [0x7f8e0f40e773] > 8: (()+0x5e9b2) [0x7f8e0f40e9b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0x9ab5da] > 10: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 11: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 12: (()+0x8062) [0x7f8e0fe3f062] > 13: (clone()+0x6d) [0x7f8e0ebd3a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > 0> 2014-07-09 21:01:10.201968 7f8e07c21700 -1 *** Caught signal > (Aborted) ** > in thread 7f8e07c21700 > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-mds() [0x8d81f2] > 2: (()+0xf880) [0x7f8e0fe46880] > 3: (gsignal()+0x39) [0x7f8e0eb233a9] > 4: (abort()+0x148) [0x7f8e0eb264c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5] > 6: (()+0x5e746) [0x7f8e0f40e746] > 7: (()+0x5e773) [0x7f8e0f40e773] > 8: (()+0x5e9b2) [0x7f8e0f40e9b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0x9ab5da] > 10: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 11: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 12: (()+0x8062) [0x7f8e0fe3f062] > 13: (clone()+0x6d) [0x7f8e0ebd3a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > Aborted > root at MDS1:/var/log/ceph# /usr/bin/ceph-mds -i MDS1 --pid-file > /var/run/ceph/mds.MDS1.pid -c /etc/ceph/ceph.conf --cluster ceph -f > --debug-mds=20 --debug-journaler=10 > starting mds.MDS1 at :/0 > > > mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread > 7fb7f7b83700 time 2014-07-09 23:21:43.383304 > > mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) > > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > > > 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > > > 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > > > 3: (()+0x8062) [0x7fb7ffda1062] > > > 4: (clone()+0x6d) [0x7fb7feb35a3d] > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In function 'void > MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 > 23:21:43.383304 > > mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) > > > > > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > > > 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > > > 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > > > 3: (()+0x8062) [0x7fb7ffda1062] > > > 4: (clone()+0x6d) [0x7fb7feb35a3d] > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > > > > 0> 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In > function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 > 23:21:43.383304 > mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) > > > > > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > > > 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > > > 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > > > 3: (()+0x8062) [0x7fb7ffda1062] > > > 4: (clone()+0x6d) [0x7fb7feb35a3d] > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > > > > terminate called after throwing an instance of 'ceph::FailedAssertion' > > > *** Caught signal (Aborted) ** > in thread 7fb7f7b83700 > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-mds() [0x8d81f2] > 2: (()+0xf880) [0x7fb7ffda8880] > 3: (gsignal()+0x39) [0x7fb7fea853a9] > 4: (abort()+0x148) [0x7fb7fea884c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5] > 6: (()+0x5e746) [0x7fb7ff370746] > 7: (()+0x5e773) [0x7fb7ff370773] > 8: (()+0x5e9b2) [0x7fb7ff3709b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0x9ab5da] > 10: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 11: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 12: (()+0x8062) [0x7fb7ffda1062] > 13: (clone()+0x6d) [0x7fb7feb35a3d] > 2014-07-09 23:21:43.394324 7fb7f7b83700 -1 *** Caught signal (Aborted) ** > in thread 7fb7f7b83700 > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-mds() [0x8d81f2] > 2: (()+0xf880) [0x7fb7ffda8880] > 3: (gsignal()+0x39) [0x7fb7fea853a9] > 4: (abort()+0x148) [0x7fb7fea884c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5] > 6: (()+0x5e746) [0x7fb7ff370746] > 7: (()+0x5e773) [0x7fb7ff370773] > 8: (()+0x5e9b2) [0x7fb7ff3709b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0x9ab5da] > 10: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 11: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 12: (()+0x8062) [0x7fb7ffda1062] > 13: (clone()+0x6d) [0x7fb7feb35a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > 0> 2014-07-09 23:21:43.394324 7fb7f7b83700 -1 *** Caught signal > (Aborted) ** > in thread 7fb7f7b83700 > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-mds() [0x8d81f2] > 2: (()+0xf880) [0x7fb7ffda8880] > 3: (gsignal()+0x39) [0x7fb7fea853a9] > 4: (abort()+0x148) [0x7fb7fea884c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5] > 6: (()+0x5e746) [0x7fb7ff370746] > 7: (()+0x5e773) [0x7fb7ff370773] > 8: (()+0x5e9b2) [0x7fb7ff3709b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0x9ab5da] > 10: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 11: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 12: (()+0x8062) [0x7fb7ffda1062] > 13: (clone()+0x6d) [0x7fb7feb35a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > Aborted > > Felt like OSD1 was trashed so I removed osd.0 osd.1 osd.2. > > Still seeing below, and can't get MDS up. > > HEALTH_ERR 154 pgs degraded; 38 pgs inconsistent; 192 pgs stuck unclean; > recovery 1374024/3513098 objects degraded (39.111%); 1374 scrub errors; mds > cluster is degraded; mds MDS1 is laggy > > Is there something I can try to bring this file system up again? =P I > would like to access some of those data again. Let me know if you need any > additional info. I was running Debian kernel 3.13.1 for first part, then > 3.14.1 when I upgraded ceph to 0.82. > > Regards, > Hong > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140717/0f39474e/attachment.htm>