Power Outage

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Fri, 18 Jul 2014 14:14:11 -0700

If osd.3, osd.4, and osd.5 are stable, your cluster should be working
again.  What does ceph status say?

I was able to re-add removed osd.
Here's what I did on my dev cluster:
stop ceph-osd id=0
ceph osd down 0
ceph osd out 0
ceph osd rm 0
ceph osd crush rm osd.0

Now my osd tree and osd dump do not show osd.0.  The cluster was degraded,
but did not do any backfilling because I require 3x replication on 3
different hosts, and Ceph can't satisfy that with 2 osds.

On the same host, I ran:
ceph osd create        # Returned ID 0
start ceph-osd id=0

osd.0 started up and joined the cluster.  Once peering completed, all of
the PGs recovered quickly.  I didn't have any writes on the cluster while I
was doing this.

So it looks like you can just re-create and start those deleted osds.

In your situation, I would do the following.  Before you start, go through
this, and make sure you understand all the steps.  Worst case, you can
always undo this by removing the osds again, and you'll be back to where
you are now.

ceph osd set nobackfill
ceph osd set norecovery
ceph osd set noin
ceph create osd   # Should return 0.  Abort if it doesn't.
ceph create osd   # Should return 1.  Abort if it doesn't.
ceph create osd   # Should return 2.  Abort if it doesn't.
start ceph-osd id=0

Watch ceph -w and top.  Hopefully ceph-osd id=0 will use some CPU, then go
UP, and drop to 0% cpu.  If so,
ceph osd unset noin
restart ceph-osd id=0

Now osd.0 should go UP and IN, use some CPU for a while, then drop to 0%
cpu.  If osd.0 drops out now,  set noout, and shut it down.

set noin again, and start osd.1.  When it's stable, do it again for osd.2.

Once as many as possible are up and stable:
ceph osd unset nobackfill
ceph osd unset norecovery

Now it should start recovering.  If your osds start dropping out now,  set
noout, and shut down the ones that are having problems.

The goal is to get all the stable osds up, in, and recovered.  Once that's
done, we can figure out what to do with the unstable osds.

On Thu, Jul 17, 2014 at 9:29 PM, hjcho616 <hjcho616 at yahoo.com> wrote:

> Sorry Craig.  I thought I sent both but second part didn't copy right.
>  For some reason over night MDS and MON decided to stop so I started it
> when I was running those commands.  Interestingly MDS didn't fail at the
> time like it used to.  So I thought something was being fixed?  Then I now
> realize MDS probably couldn't get to the data because OSD were down.  Now
> that I brought up the OSDs MDS crashed again. =P
>
> $ ceph osd tree
> # id    weight  type name       up/down reweight
> -1      5.46    root default
> -2      0               host OSD1
> -3      5.46            host OSD2
> 3       1.82                    osd.3   up      1
> 4       1.82                    osd.4   up      1
> 5       1.82                    osd.5   up      1
>
> $ ceph osd dump
> epoch 3125
> fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
> created 2014-02-08 01:57:34.086532
> modified 2014-07-17 23:24:10.823596
> flags
> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45
> stripe_width 0
> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
> max_osd 6
> osd.3 up   in  weight 1 up_from 3120 up_thru 3122 down_at 3116
> last_clean_interval [2858,3113) 192.168.1.31:6803/13623
> 192.168.2.31:6802/13623 192.168.2.31:6803/13623 192.168.1.31:6804/13623
> exists,up 4f86a418-6c67-4cb4-83a1-6c123c890036
> osd.4 up   in  weight 1 up_from 3121 up_thru 3122 down_at 3116
> last_clean_interval [2859,3113) 192.168.1.31:6806/13991
> 192.168.2.31:6804/13991 192.168.2.31:6805/13991 192.168.1.31:6807/13991
> exists,up 3d5e3843-7a47-44b0-b276-61c4b1d62900
> osd.5 up   in  weight 1 up_from 3118 up_thru 3118 down_at 3116
> last_clean_interval [2856,3113) 192.168.1.31:6800/13249
> 192.168.2.31:6800/13249 192.168.2.31:6801/13249 192.168.1.31:6801/13249
> exists,up eec86483-2f35-48a4-a154-2eaf26be06b9
> pg_temp 0.2 [4,3]
> pg_temp 0.a [4,5]
> pg_temp 0.c [3,4]
> pg_temp 0.10 [3,4]
> pg_temp 0.15 [3,5]
> pg_temp 0.17 [3,5]
> pg_temp 0.2f [4,5]
> pg_temp 0.3b [4,3]
> pg_temp 0.3c [3,5]
> pg_temp 0.3d [4,5]
> pg_temp 1.1 [4,3]
> pg_temp 1.9 [4,5]
> pg_temp 1.b [3,4]
> pg_temp 1.14 [3,5]
> pg_temp 1.16 [3,5]
> pg_temp 1.2e [4,5]
> pg_temp 1.3a [4,3]
> pg_temp 1.3b [3,5]
> pg_temp 1.3c [4,5]
> pg_temp 2.0 [4,3]
> pg_temp 2.8 [4,5]
> pg_temp 2.a [3,4]
> pg_temp 2.13 [3,5]
> pg_temp 2.15 [3,5]
> pg_temp 2.2d [4,5]
> pg_temp 2.39 [4,3]
> pg_temp 2.3a [3,5]
> pg_temp 2.3b [4,5]
> blacklist 192.168.1.20:6802/30894 expires 2014-07-17 23:48:10.823576
> blacklist 192.168.1.20:6801/30651 expires 2014-07-17 23:47:55.562984
>
> Regards,
> Hong
>
>
>   On Thursday, July 17, 2014 3:30 PM, Craig Lewis <
> clewis at centraldesktop.com> wrote:
>
>
> You gave me 'ceph osd dump' twice.... can I see 'ceph osd tree' too?
>
> Why are osd.3, osd.4, and osd.5 down?
>
>
> On Thu, Jul 17, 2014 at 11:45 AM, hjcho616 <hjcho616 at yahoo.com> wrote:
>
> Thank you for looking at this.  Below are the outputs you requested.
>
> # ceph osd dump
> epoch 3117
> fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
> created 2014-02-08 01:57:34.086532
> modified 2014-07-16 22:13:04.385914
> flags
> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45
> stripe_width 0
> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
> max_osd 6
> osd.3 down in  weight 1 up_from 2858 up_thru 3040 down_at 3116
> last_clean_interval [2830,2851) 192.168.1.31:6803/5127
> 192.168.2.31:6805/5127 192.168.2.31:6806/5127 192.168.1.31:6805/5127
> exists 4f86a418-6c67-4cb4-83a1-6c123c890036
> osd.4 down in  weight 1 up_from 2859 up_thru 3043 down_at 3116
> last_clean_interval [2835,2849) 192.168.1.31:6807/5310
> 192.168.2.31:6807/5310 192.168.2.31:6808/5310 192.168.1.31:6808/5310
> exists 3d5e3843-7a47-44b0-b276-61c4b1d62900
> osd.5 down in  weight 1 up_from 2856 up_thru 3042 down_at 3116
> last_clean_interval [2837,2853) 192.168.1.31:6800/4969
> 192.168.2.31:6801/4969 192.168.2.31:6804/4969 192.168.1.31:6801/4969
> exists eec86483-2f35-48a4-a154-2eaf26be06b9
>
> # ceph osd dump
> epoch 3117
> fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
> created 2014-02-08 01:57:34.086532
> modified 2014-07-16 22:13:04.385914
> flags
> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45
> stripe_width 0
> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
> max_osd 6
> osd.3 down in  weight 1 up_from 2858 up_thru 3040 down_at 3116
> last_clean_interval [2830,2851) 192.168.1.31:6803/5127
> 192.168.2.31:6805/5127 192.168.2.31:6806/5127 192.168.1.31:6805/5127
> exists 4f86a418-6c67-4cb4-83a1-6c123c890036
> osd.4 down in  weight 1 up_from 2859 up_thru 3043 down_at 3116
> last_clean_interval [2835,2849) 192.168.1.31:6807/5310
> 192.168.2.31:6807/5310 192.168.2.31:6808/5310 192.168.1.31:6808/5310
> exists 3d5e3843-7a47-44b0-b276-61c4b1d62900
> osd.5 down in  weight 1 up_from 2856 up_thru 3042 down_at 3116
> last_clean_interval [2837,2853) 192.168.1.31:6800/4969
> 192.168.2.31:6801/4969 192.168.2.31:6804/4969 192.168.1.31:6801/4969
> exists eec86483-2f35-48a4-a154-2eaf26be06b9
>
> Regards,
> Hong
>
>
>
>   On Thursday, July 17, 2014 12:02 PM, Craig Lewis <
> clewis at centraldesktop.com> wrote:
>
>
> I don't believe you can re-add an OSD after `ceph osd rm`, but it's worth
> a shot.  Let me see what I can do on my dev cluster.
>
> What does `ceph osd dump` and `ceph osd tree` say?  I want to make sure
> I'm starting from the same point you are.
>
>
>
> On Wed, Jul 16, 2014 at 7:39 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>
> I did a "ceph osd rm" for all three but I didn't do anything else to it
> afterwards.  Can this be added back?
>
> Regards,
> Hong
>
>
>   On Wednesday, July 16, 2014 6:54 PM, Craig Lewis <
> clewis at centraldesktop.com> wrote:
>
>
>  For some reason you ended up in my spam folder.  That might be why you
> didn't get any responses.
>
>
> Have you destroyed osd.0, osd.1, and osd.2?  If not, try bringing them up
> one a time.  You might have just one bad disk, which is much better than
> 50% of your disks.
>
> How is the ceph-osd process behaving when it hits the suicide timeout?  I
> had some problems a while back where the ceph-osd process would startup,
> start consuming ~200% CPU for a while, then get stuck using almost exactly
> 100% CPU.  It would get kicked out of the cluster for being unresponsive,
> then suicide.  Repeat.  If that's happening here, I can suggest some things
> to try.
>
>
>
>
>
> On Fri, Jul 11, 2014 at 9:12 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>
> I have 2 OSD machines with 3 OSD running on each.  One MDS server with 3
> daemons running.  Ran cephfs mostly on 0.78.  One night we lost power for
> split second.  MDS1 and OSD2 went down, OSD1 seemed OK, well turns out OSD1
> suffered most.  Those two machines rebooted and seemed ok except it had
> some inconsistencies.  I waited for a while, didn't fix itself.  So I
> issued 'ceph pg repair pgnum'.  It would try some and some OSD would crash.
>  Tried this for multiple days.  Got some PGs fixed... but mostly it would
> crash an OSD and stop recovering.  dmesg shows something like below.
>
>
>
> [  740.059498] traps: ceph-osd[5279] general protection ip:7f84e75ec75e
> sp:7fff00045bc0 error:0 in libtcmalloc.so.4.1.0[7f84e75b3000+4a000]
>
> and ceph osd log shows something like this.
>
>      -2> 2014-07-09 20:51:01.163571 7fe0f4617700  1 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had timed out after 60
>     -1> 2014-07-09 20:51:01.163609 7fe0f4617700  1 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had suicide timed out
> after 180
>      0> 2014-07-09 20:51:01.169542 7fe0f4617700 -1 common/HeartbeatMap.cc:
> In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
> const char*, time_t)' thread 7fe0f4617700 time 2014-07-09 20:51:01.163642
> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> long)+0x2eb) [0xad2cbb]
>  2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>  3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>  4: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>  5: (()+0x8062) [0x7fe0f797e062]
>  6: (clone()+0x6d) [0x7fe0f62bea3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 keyvaluestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.0.log
> --- end dump of recent events ---
> 2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal (Aborted) **
>  in thread 7fe0f4617700
>
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-osd() [0xaac562]
>  2: (()+0xf880) [0x7fe0f7985880]
>  3: (gsignal()+0x39) [0x7fe0f620e3a9]
>  4: (abort()+0x148) [0x7fe0f62114c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5]
>  6: (()+0x5e746) [0x7fe0f6af9746]
>  7: (()+0x5e773) [0x7fe0f6af9773]
>  8: (()+0x5e9b2) [0x7fe0f6af99b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0xb85b6a]
>  10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> long)+0x2eb) [0xad2cbb]
>  11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>  12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>  13: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>  14: (()+0x8062) [0x7fe0f797e062]
>  15: (clone()+0x6d) [0x7fe0f62bea3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- begin dump of recent events ---
>      0> 2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal
> (Aborted) **
>  in thread 7fe0f4617700
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-osd() [0xaac562]
>  2: (()+0xf880) [0x7fe0f7985880]
>  3: (gsignal()+0x39) [0x7fe0f620e3a9]
>  4: (abort()+0x148) [0x7fe0f62114c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5]
>  6: (()+0x5e746) [0x7fe0f6af9746]
>  7: (()+0x5e773) [0x7fe0f6af9773]
>  8: (()+0x5e9b2) [0x7fe0f6af99b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0xb85b6a]
>  10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> long)+0x2eb) [0xad2cbb]
>  11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>  12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>  13: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>  14: (()+0x8062) [0x7fe0f797e062]
>  15: (clone()+0x6d) [0x7fe0f62bea3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 keyvaluestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.0.log
> --- end dump of recent events ---
>
> After several attempts at it, osd.2 (which was on OSD1 which survived the
> power event) never comes up.  Looks like journal was corrupted
>
>     -1> 2014-07-09 20:44:14.992840 7f12256b67c0 -1 journal Unable to read
> past sequence 2157634 but header indicates the journal has committed up
> through 2157670, journal is corrupt
>      0> 2014-07-09 20:44:14.998742 7f12256b67c0 -1 os/FileJournal.cc: In
> function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&,
> bool*)' thread 7f12256b67c0 time 2014-07-09 20:44:14.993082
> os/FileJournal.cc: 1677: FAILED assert(0)
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&,
> bool*)+0x467) [0xa8d497]
>  2: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) [0x9dfebe]
>  3: (FileStore::mount()+0x32c9) [0x9b7939]
>  4: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa]
>  5: (main()+0x2237) [0x730837]
>  6: (__libc_start_main()+0xf5) [0x7f12236bdb45]
>  7: /usr/bin/ceph-osd() [0x734479]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 keyvaluestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.2.log
> --- end dump of recent events ---
> 2014-07-09 20:44:15.010090 7f12256b67c0 -1 *** Caught signal (Aborted) **
>  in thread 7f12256b67c0
> ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-osd() [0xaac562]
>  2: (()+0xf880) [0x7f1224e48880]
>  3: (gsignal()+0x39) [0x7f12236d13a9]
>  4: (abort()+0x148) [0x7f12236d44c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1223fbe5e5]
>  6: (()+0x5e746) [0x7f1223fbc746]
>  7: (()+0x5e773) [0x7f1223fbc773]
>  8: (()+0x5e9b2) [0x7f1223fbc9b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0xb85b6a]
>  10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&,
> bool*)+0x467) [0xa8d497]
>  11: (JournalingObjectStore::journal_replay(unsigned long)+0x22e)
> [0x9dfebe]
>  12: (FileStore::mount()+0x32c9) [0x9b7939]
>  13: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa]
>  14: (main()+0x2237) [0x730837]
>  15: (__libc_start_main()+0xf5) [0x7f12236bdb45]
>   16: /usr/bin/ceph-osd() [0x734479]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- begin dump of recent events ---
>      0> 2014-07-09 20:44:15.010090 7f12256b67c0 -1 *** Caught signal
> (Aborted) **
>  in thread 7f12256b67c0
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-osd() [0xaac562]
>  2: (()+0xf880) [0x7f1224e48880]
>  3: (gsignal()+0x39) [0x7f12236d13a9]
>  4: (abort()+0x148) [0x7f12236d44c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1223fbe5e5]
>  6: (()+0x5e746) [0x7f1223fbc746]
>  7: (()+0x5e773) [0x7f1223fbc773]
>  8: (()+0x5e9b2) [0x7f1223fbc9b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0xb85b6a]
>  10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&,
> bool*)+0x467) [0xa8d497]
>  11: (JournalingObjectStore::journal_replay(unsigned long)+0x22e)
> [0x9dfebe]
>  12: (FileStore::mount()+0x32c9) [0x9b7939]
>  13: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa]
>  14: (main()+0x2237) [0x730837]
>  15: (__libc_start_main()+0xf5) [0x7f12236bdb45]
>  16: /usr/bin/ceph-osd() [0x734479]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>  --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 keyvaluestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.2.log
> --- end dump of recent events ---
>
>
> So I thought maybe upgrading 0.82 would give it a better option at fixing
> things... so I did, now not only those OSDs fail (osd.1 is up but with 14M
> of memory only... I assume that's broky too), but MDS fails too.
>
> # /usr/bin/ceph-mds -i MDS1 --pid-file /var/run/ceph/mds.MDS1.pid -c
> /etc/ceph/ceph.conf --cluster ceph -f --debug-mds=20 --debug-journaler=10
> starting mds.MDS1 at :/0
> mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread
> 7f8e07c21700 time 2014-07-09 21:01:10.190965
> mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  3: (()+0x8062) [0x7f8e0fe3f062]
>  4: (clone()+0x6d) [0x7f8e0ebd3a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
> 2014-07-09 21:01:10.192936 7f8e07c21700 -1 mds/MDLog.cc: In function 'void
> MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09 21:01:10.190965
> mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  3: (()+0x8062) [0x7f8e0fe3f062]
>  4: (clone()+0x6d) [0x7f8e0ebd3a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>      0> 2014-07-09 21:01:10.192936 7f8e07c21700 -1 mds/MDLog.cc: In
> function 'void MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09
> 21:01:10.190965
> mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  3: (()+0x8062) [0x7f8e0fe3f062]
>  4: (clone()+0x6d) [0x7f8e0ebd3a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> terminate called after throwing an instance of 'ceph::FailedAssertion'
> *** Caught signal (Aborted) **
>  in thread 7f8e07c21700
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-mds() [0x8d81f2]
>  2: (()+0xf880) [0x7f8e0fe46880]
>  3: (gsignal()+0x39) [0x7f8e0eb233a9]
>  4: (abort()+0x148) [0x7f8e0eb264c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5]
>  6: (()+0x5e746) [0x7f8e0f40e746]
>  7: (()+0x5e773) [0x7f8e0f40e773]
>  8: (()+0x5e9b2) [0x7f8e0f40e9b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x9ab5da]
>  10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  12: (()+0x8062) [0x7f8e0fe3f062]
>  13: (clone()+0x6d) [0x7f8e0ebd3a3d]
> 2014-07-09 21:01:10.201968 7f8e07c21700 -1 *** Caught signal (Aborted) **
>  in thread 7f8e07c21700
>
>   ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-mds() [0x8d81f2]
>  2: (()+0xf880) [0x7f8e0fe46880]
>  3: (gsignal()+0x39) [0x7f8e0eb233a9]
>  4: (abort()+0x148) [0x7f8e0eb264c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5]
>  6: (()+0x5e746) [0x7f8e0f40e746]
>  7: (()+0x5e773) [0x7f8e0f40e773]
>  8: (()+0x5e9b2) [0x7f8e0f40e9b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x9ab5da]
>  10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  12: (()+0x8062) [0x7f8e0fe3f062]
>  13: (clone()+0x6d) [0x7f8e0ebd3a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>      0> 2014-07-09 21:01:10.201968 7f8e07c21700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f8e07c21700
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-mds() [0x8d81f2]
>  2: (()+0xf880) [0x7f8e0fe46880]
>  3: (gsignal()+0x39) [0x7f8e0eb233a9]
>  4: (abort()+0x148) [0x7f8e0eb264c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5]
>  6: (()+0x5e746) [0x7f8e0f40e746]
>  7: (()+0x5e773) [0x7f8e0f40e773]
>  8: (()+0x5e9b2) [0x7f8e0f40e9b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x9ab5da]
>  10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  12: (()+0x8062) [0x7f8e0fe3f062]
>  13: (clone()+0x6d) [0x7f8e0ebd3a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> Aborted
> root at MDS1:/var/log/ceph# /usr/bin/ceph-mds -i MDS1 --pid-file
> /var/run/ceph/mds.MDS1.pid -c /etc/ceph/ceph.conf --cluster ceph -f
> --debug-mds=20 --debug-journaler=10
> starting mds.MDS1 at :/0
>
>
> mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread
> 7fb7f7b83700 time 2014-07-09 23:21:43.383304
>
> mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>
>
>  1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>
>
>  2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>
>
>  3: (()+0x8062) [0x7fb7ffda1062]
>
>
>  4: (clone()+0x6d) [0x7fb7feb35a3d]
>
>
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In function 'void
> MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09
> 23:21:43.383304
>
> mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>
>
>
>
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>
>
>  1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>
>
>  2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>
>
>  3: (()+0x8062) [0x7fb7ffda1062]
>
>
>  4: (clone()+0x6d) [0x7fb7feb35a3d]
>
>
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>
>
>
>      0> 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In
> function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09
> 23:21:43.383304
> mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>
>
>
>
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>
>
>  1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>
>
>  2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>
>
>  3: (()+0x8062) [0x7fb7ffda1062]
>
>
>  4: (clone()+0x6d) [0x7fb7feb35a3d]
>
>
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>
>
>
> terminate called after throwing an instance of 'ceph::FailedAssertion'
>
>
> *** Caught signal (Aborted) **
>  in thread 7fb7f7b83700
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-mds() [0x8d81f2]
>  2: (()+0xf880) [0x7fb7ffda8880]
>  3: (gsignal()+0x39) [0x7fb7fea853a9]
>  4: (abort()+0x148) [0x7fb7fea884c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5]
>  6: (()+0x5e746) [0x7fb7ff370746]
>  7: (()+0x5e773) [0x7fb7ff370773]
>  8: (()+0x5e9b2) [0x7fb7ff3709b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x9ab5da]
>  10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  12: (()+0x8062) [0x7fb7ffda1062]
>  13: (clone()+0x6d) [0x7fb7feb35a3d]
> 2014-07-09 23:21:43.394324 7fb7f7b83700 -1 *** Caught signal (Aborted) **
>  in thread 7fb7f7b83700
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-mds() [0x8d81f2]
>  2: (()+0xf880) [0x7fb7ffda8880]
>  3: (gsignal()+0x39) [0x7fb7fea853a9]
>  4: (abort()+0x148) [0x7fb7fea884c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5]
>  6: (()+0x5e746) [0x7fb7ff370746]
>  7: (()+0x5e773) [0x7fb7ff370773]
>  8: (()+0x5e9b2) [0x7fb7ff3709b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x9ab5da]
>  10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  12: (()+0x8062) [0x7fb7ffda1062]
>  13: (clone()+0x6d) [0x7fb7feb35a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>      0> 2014-07-09 23:21:43.394324 7fb7f7b83700 -1 *** Caught signal
> (Aborted) **
>  in thread 7fb7f7b83700
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-mds() [0x8d81f2]
>  2: (()+0xf880) [0x7fb7ffda8880]
>  3: (gsignal()+0x39) [0x7fb7fea853a9]
>  4: (abort()+0x148) [0x7fb7fea884c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5]
>  6: (()+0x5e746) [0x7fb7ff370746]
>  7: (()+0x5e773) [0x7fb7ff370773]
>  8: (()+0x5e9b2) [0x7fb7ff3709b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x9ab5da]
>  10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  12: (()+0x8062) [0x7fb7ffda1062]
>  13: (clone()+0x6d) [0x7fb7feb35a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> Aborted
>
> Felt like OSD1 was trashed so I removed osd.0 osd.1 osd.2.
>
> Still seeing below, and can't get MDS up.
>
> HEALTH_ERR 154 pgs degraded; 38 pgs inconsistent; 192 pgs stuck unclean;
> recovery 1374024/3513098 objects degraded (39.111%); 1374 scrub errors; mds
> cluster is degraded; mds MDS1 is laggy
>
> Is there something I can try to bring this file system up again? =P  I
> would like to access some of those data again.  Let me know if you need any
> additional info.  I was running Debian kernel 3.13.1 for first part, then
> 3.14.1 when I upgraded ceph to 0.82.
>
> Regards,
> Hong
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140718/7d12754e/attachment.htm>