That I can't help you with. I'm a pure RadosGW user. But OSD stability affects everybody. :-P On Fri, Jul 18, 2014 at 2:34 PM, hjcho616 <hjcho616 at yahoo.com> wrote: > Thanks Craig. I will try this soon. BTW should I upgrade to 0.80.4 > first? The MDS journal issue seems to be one of the issue I am running > into. > > Regards, > Hong > > > On Friday, July 18, 2014 4:14 PM, Craig Lewis <clewis at centraldesktop.com> > wrote: > > > If osd.3, osd.4, and osd.5 are stable, your cluster should be working > again. What does ceph status say? > > > I was able to re-add removed osd. > Here's what I did on my dev cluster: > stop ceph-osd id=0 > ceph osd down 0 > ceph osd out 0 > ceph osd rm 0 > ceph osd crush rm osd.0 > > Now my osd tree and osd dump do not show osd.0. The cluster was degraded, > but did not do any backfilling because I require 3x replication on 3 > different hosts, and Ceph can't satisfy that with 2 osds. > > On the same host, I ran: > ceph osd create # Returned ID 0 > start ceph-osd id=0 > > > osd.0 started up and joined the cluster. Once peering completed, all of > the PGs recovered quickly. I didn't have any writes on the cluster while I > was doing this. > > So it looks like you can just re-create and start those deleted osds. > > > > In your situation, I would do the following. Before you start, go through > this, and make sure you understand all the steps. Worst case, you can > always undo this by removing the osds again, and you'll be back to where > you are now. > > ceph osd set nobackfill > ceph osd set norecovery > ceph osd set noin > ceph create osd # Should return 0. Abort if it doesn't. > ceph create osd # Should return 1. Abort if it doesn't. > ceph create osd # Should return 2. Abort if it doesn't. > start ceph-osd id=0 > > Watch ceph -w and top. Hopefully ceph-osd id=0 will use some CPU, then > go UP, and drop to 0% cpu. If so, > ceph osd unset noin > restart ceph-osd id=0 > > Now osd.0 should go UP and IN, use some CPU for a while, then drop to 0% > cpu. If osd.0 drops out now, set noout, and shut it down. > > set noin again, and start osd.1. When it's stable, do it again for osd.2. > > Once as many as possible are up and stable: > ceph osd unset nobackfill > ceph osd unset norecovery > > Now it should start recovering. If your osds start dropping out now, set > noout, and shut down the ones that are having problems. > > > The goal is to get all the stable osds up, in, and recovered. Once that's > done, we can figure out what to do with the unstable osds. > > > > > > > > > > On Thu, Jul 17, 2014 at 9:29 PM, hjcho616 <hjcho616 at yahoo.com> wrote: > > Sorry Craig. I thought I sent both but second part didn't copy right. > For some reason over night MDS and MON decided to stop so I started it > when I was running those commands. Interestingly MDS didn't fail at the > time like it used to. So I thought something was being fixed? Then I now > realize MDS probably couldn't get to the data because OSD were down. Now > that I brought up the OSDs MDS crashed again. =P > > $ ceph osd tree > # id weight type name up/down reweight > -1 5.46 root default > -2 0 host OSD1 > -3 5.46 host OSD2 > 3 1.82 osd.3 up 1 > 4 1.82 osd.4 up 1 > 5 1.82 osd.5 up 1 > > $ ceph osd dump > epoch 3125 > fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948 > created 2014-02-08 01:57:34.086532 > modified 2014-07-17 23:24:10.823596 > flags > pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 > stripe_width 0 > pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0 > pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0 > max_osd 6 > osd.3 up in weight 1 up_from 3120 up_thru 3122 down_at 3116 > last_clean_interval [2858,3113) 192.168.1.31:6803/13623 > 192.168.2.31:6802/13623 192.168.2.31:6803/13623 192.168.1.31:6804/13623 > exists,up 4f86a418-6c67-4cb4-83a1-6c123c890036 > osd.4 up in weight 1 up_from 3121 up_thru 3122 down_at 3116 > last_clean_interval [2859,3113) 192.168.1.31:6806/13991 > 192.168.2.31:6804/13991 192.168.2.31:6805/13991 192.168.1.31:6807/13991 > exists,up 3d5e3843-7a47-44b0-b276-61c4b1d62900 > osd.5 up in weight 1 up_from 3118 up_thru 3118 down_at 3116 > last_clean_interval [2856,3113) 192.168.1.31:6800/13249 > 192.168.2.31:6800/13249 192.168.2.31:6801/13249 192.168.1.31:6801/13249 > exists,up eec86483-2f35-48a4-a154-2eaf26be06b9 > pg_temp 0.2 [4,3] > pg_temp 0.a [4,5] > pg_temp 0.c [3,4] > pg_temp 0.10 [3,4] > pg_temp 0.15 [3,5] > pg_temp 0.17 [3,5] > pg_temp 0.2f [4,5] > pg_temp 0.3b [4,3] > pg_temp 0.3c [3,5] > pg_temp 0.3d [4,5] > pg_temp 1.1 [4,3] > pg_temp 1.9 [4,5] > pg_temp 1.b [3,4] > pg_temp 1.14 [3,5] > pg_temp 1.16 [3,5] > pg_temp 1.2e [4,5] > pg_temp 1.3a [4,3] > pg_temp 1.3b [3,5] > pg_temp 1.3c [4,5] > pg_temp 2.0 [4,3] > pg_temp 2.8 [4,5] > pg_temp 2.a [3,4] > pg_temp 2.13 [3,5] > pg_temp 2.15 [3,5] > pg_temp 2.2d [4,5] > pg_temp 2.39 [4,3] > pg_temp 2.3a [3,5] > pg_temp 2.3b [4,5] > blacklist 192.168.1.20:6802/30894 expires 2014-07-17 23:48:10.823576 > blacklist 192.168.1.20:6801/30651 expires 2014-07-17 23:47:55.562984 > > Regards, > Hong > > > On Thursday, July 17, 2014 3:30 PM, Craig Lewis < > clewis at centraldesktop.com> wrote: > > > You gave me 'ceph osd dump' twice.... can I see 'ceph osd tree' too? > > Why are osd.3, osd.4, and osd.5 down? > > > On Thu, Jul 17, 2014 at 11:45 AM, hjcho616 <hjcho616 at yahoo.com> wrote: > > Thank you for looking at this. Below are the outputs you requested. > > # ceph osd dump > epoch 3117 > fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948 > created 2014-02-08 01:57:34.086532 > modified 2014-07-16 22:13:04.385914 > flags > pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 > stripe_width 0 > pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0 > pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0 > max_osd 6 > osd.3 down in weight 1 up_from 2858 up_thru 3040 down_at 3116 > last_clean_interval [2830,2851) 192.168.1.31:6803/5127 > 192.168.2.31:6805/5127 192.168.2.31:6806/5127 192.168.1.31:6805/5127 > exists 4f86a418-6c67-4cb4-83a1-6c123c890036 > osd.4 down in weight 1 up_from 2859 up_thru 3043 down_at 3116 > last_clean_interval [2835,2849) 192.168.1.31:6807/5310 > 192.168.2.31:6807/5310 192.168.2.31:6808/5310 192.168.1.31:6808/5310 > exists 3d5e3843-7a47-44b0-b276-61c4b1d62900 > osd.5 down in weight 1 up_from 2856 up_thru 3042 down_at 3116 > last_clean_interval [2837,2853) 192.168.1.31:6800/4969 > 192.168.2.31:6801/4969 192.168.2.31:6804/4969 192.168.1.31:6801/4969 > exists eec86483-2f35-48a4-a154-2eaf26be06b9 > > # ceph osd dump > epoch 3117 > fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948 > created 2014-02-08 01:57:34.086532 > modified 2014-07-16 22:13:04.385914 > flags > pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 > stripe_width 0 > pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0 > pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0 > max_osd 6 > osd.3 down in weight 1 up_from 2858 up_thru 3040 down_at 3116 > last_clean_interval [2830,2851) 192.168.1.31:6803/5127 > 192.168.2.31:6805/5127 192.168.2.31:6806/5127 192.168.1.31:6805/5127 > exists 4f86a418-6c67-4cb4-83a1-6c123c890036 > osd.4 down in weight 1 up_from 2859 up_thru 3043 down_at 3116 > last_clean_interval [2835,2849) 192.168.1.31:6807/5310 > 192.168.2.31:6807/5310 192.168.2.31:6808/5310 192.168.1.31:6808/5310 > exists 3d5e3843-7a47-44b0-b276-61c4b1d62900 > osd.5 down in weight 1 up_from 2856 up_thru 3042 down_at 3116 > last_clean_interval [2837,2853) 192.168.1.31:6800/4969 > 192.168.2.31:6801/4969 192.168.2.31:6804/4969 192.168.1.31:6801/4969 > exists eec86483-2f35-48a4-a154-2eaf26be06b9 > > Regards, > Hong > > > > On Thursday, July 17, 2014 12:02 PM, Craig Lewis < > clewis at centraldesktop.com> wrote: > > > I don't believe you can re-add an OSD after `ceph osd rm`, but it's worth > a shot. Let me see what I can do on my dev cluster. > > What does `ceph osd dump` and `ceph osd tree` say? I want to make sure > I'm starting from the same point you are. > > > > On Wed, Jul 16, 2014 at 7:39 PM, hjcho616 <hjcho616 at yahoo.com> wrote: > > I did a "ceph osd rm" for all three but I didn't do anything else to it > afterwards. Can this be added back? > > Regards, > Hong > > > On Wednesday, July 16, 2014 6:54 PM, Craig Lewis < > clewis at centraldesktop.com> wrote: > > > For some reason you ended up in my spam folder. That might be why you > didn't get any responses. > > > Have you destroyed osd.0, osd.1, and osd.2? If not, try bringing them up > one a time. You might have just one bad disk, which is much better than > 50% of your disks. > > How is the ceph-osd process behaving when it hits the suicide timeout? I > had some problems a while back where the ceph-osd process would startup, > start consuming ~200% CPU for a while, then get stuck using almost exactly > 100% CPU. It would get kicked out of the cluster for being unresponsive, > then suicide. Repeat. If that's happening here, I can suggest some things > to try. > > > > > > On Fri, Jul 11, 2014 at 9:12 PM, hjcho616 <hjcho616 at yahoo.com> wrote: > > I have 2 OSD machines with 3 OSD running on each. One MDS server with 3 > daemons running. Ran cephfs mostly on 0.78. One night we lost power for > split second. MDS1 and OSD2 went down, OSD1 seemed OK, well turns out OSD1 > suffered most. Those two machines rebooted and seemed ok except it had > some inconsistencies. I waited for a while, didn't fix itself. So I > issued 'ceph pg repair pgnum'. It would try some and some OSD would crash. > Tried this for multiple days. Got some PGs fixed... but mostly it would > crash an OSD and stop recovering. dmesg shows something like below. > > > > [ 740.059498] traps: ceph-osd[5279] general protection ip:7f84e75ec75e > sp:7fff00045bc0 error:0 in libtcmalloc.so.4.1.0[7f84e75b3000+4a000] > > and ceph osd log shows something like this. > > -2> 2014-07-09 20:51:01.163571 7fe0f4617700 1 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had timed out after 60 > -1> 2014-07-09 20:51:01.163609 7fe0f4617700 1 heartbeat_map > is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had suicide timed out > after 180 > 0> 2014-07-09 20:51:01.169542 7fe0f4617700 -1 common/HeartbeatMap.cc: > In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, > const char*, time_t)' thread 7fe0f4617700 time 2014-07-09 20:51:01.163642 > common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout") > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > long)+0x2eb) [0xad2cbb] > 2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6] > 3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8] > 4: (CephContextServiceThread::entry()+0x13f) [0xb9911f] > 5: (()+0x8062) [0x7fe0f797e062] > 6: (clone()+0x6d) [0x7fe0f62bea3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 keyvaluestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.0.log > --- end dump of recent events --- > 2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal (Aborted) ** > in thread 7fe0f4617700 > > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-osd() [0xaac562] > 2: (()+0xf880) [0x7fe0f7985880] > 3: (gsignal()+0x39) [0x7fe0f620e3a9] > 4: (abort()+0x148) [0x7fe0f62114c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5] > 6: (()+0x5e746) [0x7fe0f6af9746] > 7: (()+0x5e773) [0x7fe0f6af9773] > 8: (()+0x5e9b2) [0x7fe0f6af99b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0xb85b6a] > 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > long)+0x2eb) [0xad2cbb] > 11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6] > 12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8] > 13: (CephContextServiceThread::entry()+0x13f) [0xb9911f] > 14: (()+0x8062) [0x7fe0f797e062] > 15: (clone()+0x6d) [0x7fe0f62bea3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- begin dump of recent events --- > 0> 2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal > (Aborted) ** > in thread 7fe0f4617700 > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-osd() [0xaac562] > 2: (()+0xf880) [0x7fe0f7985880] > 3: (gsignal()+0x39) [0x7fe0f620e3a9] > 4: (abort()+0x148) [0x7fe0f62114c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5] > 6: (()+0x5e746) [0x7fe0f6af9746] > 7: (()+0x5e773) [0x7fe0f6af9773] > 8: (()+0x5e9b2) [0x7fe0f6af99b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0xb85b6a] > 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, > long)+0x2eb) [0xad2cbb] > 11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6] > 12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8] > 13: (CephContextServiceThread::entry()+0x13f) [0xb9911f] > 14: (()+0x8062) [0x7fe0f797e062] > 15: (clone()+0x6d) [0x7fe0f62bea3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 keyvaluestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.0.log > --- end dump of recent events --- > > After several attempts at it, osd.2 (which was on OSD1 which survived the > power event) never comes up. Looks like journal was corrupted > > -1> 2014-07-09 20:44:14.992840 7f12256b67c0 -1 journal Unable to read > past sequence 2157634 but header indicates the journal has committed up > through 2157670, journal is corrupt > 0> 2014-07-09 20:44:14.998742 7f12256b67c0 -1 os/FileJournal.cc: In > function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, > bool*)' thread 7f12256b67c0 time 2014-07-09 20:44:14.993082 > os/FileJournal.cc: 1677: FAILED assert(0) > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, > bool*)+0x467) [0xa8d497] > 2: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) [0x9dfebe] > 3: (FileStore::mount()+0x32c9) [0x9b7939] > 4: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa] > 5: (main()+0x2237) [0x730837] > 6: (__libc_start_main()+0xf5) [0x7f12236bdb45] > 7: /usr/bin/ceph-osd() [0x734479] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 keyvaluestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.2.log > --- end dump of recent events --- > 2014-07-09 20:44:15.010090 7f12256b67c0 -1 *** Caught signal (Aborted) ** > in thread 7f12256b67c0 > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-osd() [0xaac562] > 2: (()+0xf880) [0x7f1224e48880] > 3: (gsignal()+0x39) [0x7f12236d13a9] > 4: (abort()+0x148) [0x7f12236d44c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1223fbe5e5] > 6: (()+0x5e746) [0x7f1223fbc746] > 7: (()+0x5e773) [0x7f1223fbc773] > 8: (()+0x5e9b2) [0x7f1223fbc9b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0xb85b6a] > 10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, > bool*)+0x467) [0xa8d497] > 11: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) > [0x9dfebe] > 12: (FileStore::mount()+0x32c9) [0x9b7939] > 13: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa] > 14: (main()+0x2237) [0x730837] > 15: (__libc_start_main()+0xf5) [0x7f12236bdb45] > 16: /usr/bin/ceph-osd() [0x734479] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- begin dump of recent events --- > 0> 2014-07-09 20:44:15.010090 7f12256b67c0 -1 *** Caught signal > (Aborted) ** > in thread 7f12256b67c0 > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-osd() [0xaac562] > 2: (()+0xf880) [0x7f1224e48880] > 3: (gsignal()+0x39) [0x7f12236d13a9] > 4: (abort()+0x148) [0x7f12236d44c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1223fbe5e5] > 6: (()+0x5e746) [0x7f1223fbc746] > 7: (()+0x5e773) [0x7f1223fbc773] > 8: (()+0x5e9b2) [0x7f1223fbc9b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0xb85b6a] > 10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, > bool*)+0x467) [0xa8d497] > 11: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) > [0x9dfebe] > 12: (FileStore::mount()+0x32c9) [0x9b7939] > 13: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa] > 14: (main()+0x2237) [0x730837] > 15: (__libc_start_main()+0xf5) [0x7f12236bdb45] > 16: /usr/bin/ceph-osd() [0x734479] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 0/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 1/ 3 filestore > 1/ 3 keyvaluestore > 1/ 3 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.2.log > --- end dump of recent events --- > > > So I thought maybe upgrading 0.82 would give it a better option at fixing > things... so I did, now not only those OSDs fail (osd.1 is up but with 14M > of memory only... I assume that's broky too), but MDS fails too. > > # /usr/bin/ceph-mds -i MDS1 --pid-file /var/run/ceph/mds.MDS1.pid -c > /etc/ceph/ceph.conf --cluster ceph -f --debug-mds=20 --debug-journaler=10 > starting mds.MDS1 at :/0 > mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread > 7f8e07c21700 time 2014-07-09 21:01:10.190965 > mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 3: (()+0x8062) [0x7f8e0fe3f062] > 4: (clone()+0x6d) [0x7f8e0ebd3a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > 2014-07-09 21:01:10.192936 7f8e07c21700 -1 mds/MDLog.cc: In function 'void > MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09 21:01:10.190965 > mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 3: (()+0x8062) [0x7f8e0fe3f062] > 4: (clone()+0x6d) [0x7f8e0ebd3a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > 0> 2014-07-09 21:01:10.192936 7f8e07c21700 -1 mds/MDLog.cc: In > function 'void MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09 > 21:01:10.190965 > mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 3: (()+0x8062) [0x7f8e0fe3f062] > 4: (clone()+0x6d) [0x7f8e0ebd3a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > terminate called after throwing an instance of 'ceph::FailedAssertion' > *** Caught signal (Aborted) ** > in thread 7f8e07c21700 > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-mds() [0x8d81f2] > 2: (()+0xf880) [0x7f8e0fe46880] > 3: (gsignal()+0x39) [0x7f8e0eb233a9] > 4: (abort()+0x148) [0x7f8e0eb264c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5] > 6: (()+0x5e746) [0x7f8e0f40e746] > 7: (()+0x5e773) [0x7f8e0f40e773] > 8: (()+0x5e9b2) [0x7f8e0f40e9b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0x9ab5da] > 10: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 11: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 12: (()+0x8062) [0x7f8e0fe3f062] > 13: (clone()+0x6d) [0x7f8e0ebd3a3d] > 2014-07-09 21:01:10.201968 7f8e07c21700 -1 *** Caught signal (Aborted) ** > in thread 7f8e07c21700 > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-mds() [0x8d81f2] > 2: (()+0xf880) [0x7f8e0fe46880] > 3: (gsignal()+0x39) [0x7f8e0eb233a9] > 4: (abort()+0x148) [0x7f8e0eb264c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5] > 6: (()+0x5e746) [0x7f8e0f40e746] > 7: (()+0x5e773) [0x7f8e0f40e773] > 8: (()+0x5e9b2) [0x7f8e0f40e9b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0x9ab5da] > 10: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 11: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 12: (()+0x8062) [0x7f8e0fe3f062] > 13: (clone()+0x6d) [0x7f8e0ebd3a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > 0> 2014-07-09 21:01:10.201968 7f8e07c21700 -1 *** Caught signal > (Aborted) ** > in thread 7f8e07c21700 > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-mds() [0x8d81f2] > 2: (()+0xf880) [0x7f8e0fe46880] > 3: (gsignal()+0x39) [0x7f8e0eb233a9] > 4: (abort()+0x148) [0x7f8e0eb264c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5] > 6: (()+0x5e746) [0x7f8e0f40e746] > 7: (()+0x5e773) [0x7f8e0f40e773] > 8: (()+0x5e9b2) [0x7f8e0f40e9b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0x9ab5da] > 10: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 11: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 12: (()+0x8062) [0x7f8e0fe3f062] > 13: (clone()+0x6d) [0x7f8e0ebd3a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > Aborted > root at MDS1:/var/log/ceph# /usr/bin/ceph-mds -i MDS1 --pid-file > /var/run/ceph/mds.MDS1.pid -c /etc/ceph/ceph.conf --cluster ceph -f > --debug-mds=20 --debug-journaler=10 > starting mds.MDS1 at :/0 > > > mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread > 7fb7f7b83700 time 2014-07-09 23:21:43.383304 > > mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) > > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > > > 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > > > 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > > > 3: (()+0x8062) [0x7fb7ffda1062] > > > 4: (clone()+0x6d) [0x7fb7feb35a3d] > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In function 'void > MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 > 23:21:43.383304 > > mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) > > > > > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > > > 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > > > 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > > > 3: (()+0x8062) [0x7fb7ffda1062] > > > 4: (clone()+0x6d) [0x7fb7feb35a3d] > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > > > > 0> 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In > function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 > 23:21:43.383304 > mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) > > > > > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > > > 1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > > > 2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > > > 3: (()+0x8062) [0x7fb7ffda1062] > > > 4: (clone()+0x6d) [0x7fb7feb35a3d] > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > > > > terminate called after throwing an instance of 'ceph::FailedAssertion' > > > *** Caught signal (Aborted) ** > in thread 7fb7f7b83700 > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-mds() [0x8d81f2] > 2: (()+0xf880) [0x7fb7ffda8880] > 3: (gsignal()+0x39) [0x7fb7fea853a9] > 4: (abort()+0x148) [0x7fb7fea884c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5] > 6: (()+0x5e746) [0x7fb7ff370746] > 7: (()+0x5e773) [0x7fb7ff370773] > 8: (()+0x5e9b2) [0x7fb7ff3709b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0x9ab5da] > 10: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 11: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 12: (()+0x8062) [0x7fb7ffda1062] > 13: (clone()+0x6d) [0x7fb7feb35a3d] > 2014-07-09 23:21:43.394324 7fb7f7b83700 -1 *** Caught signal (Aborted) ** > in thread 7fb7f7b83700 > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-mds() [0x8d81f2] > 2: (()+0xf880) [0x7fb7ffda8880] > 3: (gsignal()+0x39) [0x7fb7fea853a9] > 4: (abort()+0x148) [0x7fb7fea884c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5] > 6: (()+0x5e746) [0x7fb7ff370746] > 7: (()+0x5e773) [0x7fb7ff370773] > 8: (()+0x5e9b2) [0x7fb7ff3709b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0x9ab5da] > 10: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 11: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 12: (()+0x8062) [0x7fb7ffda1062] > 13: (clone()+0x6d) [0x7fb7feb35a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > 0> 2014-07-09 23:21:43.394324 7fb7f7b83700 -1 *** Caught signal > (Aborted) ** > in thread 7fb7f7b83700 > > ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) > 1: /usr/bin/ceph-mds() [0x8d81f2] > 2: (()+0xf880) [0x7fb7ffda8880] > 3: (gsignal()+0x39) [0x7fb7fea853a9] > 4: (abort()+0x148) [0x7fb7fea884c8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5] > 6: (()+0x5e746) [0x7fb7ff370746] > 7: (()+0x5e773) [0x7fb7ff370773] > 8: (()+0x5e9b2) [0x7fb7ff3709b2] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x40a) [0x9ab5da] > 10: (MDLog::_replay_thread()+0x197b) [0x85a3cb] > 11: (MDLog::ReplayThread::entry()+0xd) [0x66466d] > 12: (()+0x8062) [0x7fb7ffda1062] > 13: (clone()+0x6d) [0x7fb7feb35a3d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > > Aborted > > Felt like OSD1 was trashed so I removed osd.0 osd.1 osd.2. > > Still seeing below, and can't get MDS up. > > HEALTH_ERR 154 pgs degraded; 38 pgs inconsistent; 192 pgs stuck unclean; > recovery 1374024/3513098 objects degraded (39.111%); 1374 scrub errors; mds > cluster is degraded; mds MDS1 is laggy > > Is there something I can try to bring this file system up again? =P I > would like to access some of those data again. Let me know if you need any > additional info. I was running Debian kernel 3.13.1 for first part, then > 3.14.1 when I upgraded ceph to 0.82. > > Regards, > Hong > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140718/4f84f82e/attachment.htm>