Power Outage

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Fri, 18 Jul 2014 15:16:32 -0700

That I can't help you with.  I'm a pure RadosGW user.  But OSD stability
affects everybody. :-P

On Fri, Jul 18, 2014 at 2:34 PM, hjcho616 <hjcho616 at yahoo.com> wrote:

> Thanks Craig.  I will try this soon.  BTW should I upgrade to 0.80.4
> first?  The MDS journal issue seems to be one of the issue I am running
> into.
>
> Regards,
> Hong
>
>
>   On Friday, July 18, 2014 4:14 PM, Craig Lewis <clewis at centraldesktop.com>
> wrote:
>
>
>  If osd.3, osd.4, and osd.5 are stable, your cluster should be working
> again.  What does ceph status say?
>
>
> I was able to re-add removed osd.
> Here's what I did on my dev cluster:
> stop ceph-osd id=0
> ceph osd down 0
> ceph osd out 0
> ceph osd rm 0
> ceph osd crush rm osd.0
>
> Now my osd tree and osd dump do not show osd.0.  The cluster was degraded,
> but did not do any backfilling because I require 3x replication on 3
> different hosts, and Ceph can't satisfy that with 2 osds.
>
> On the same host, I ran:
> ceph osd create        # Returned ID 0
> start ceph-osd id=0
>
>
> osd.0 started up and joined the cluster.  Once peering completed, all of
> the PGs recovered quickly.  I didn't have any writes on the cluster while I
> was doing this.
>
> So it looks like you can just re-create and start those deleted osds.
>
>
>
> In your situation, I would do the following.  Before you start, go through
> this, and make sure you understand all the steps.  Worst case, you can
> always undo this by removing the osds again, and you'll be back to where
> you are now.
>
> ceph osd set nobackfill
> ceph osd set norecovery
> ceph osd set noin
> ceph create osd   # Should return 0.  Abort if it doesn't.
> ceph create osd   # Should return 1.  Abort if it doesn't.
> ceph create osd   # Should return 2.  Abort if it doesn't.
> start ceph-osd id=0
>
> Watch ceph -w and top.  Hopefully ceph-osd id=0 will use some CPU, then
> go UP, and drop to 0% cpu.  If so,
> ceph osd unset noin
> restart ceph-osd id=0
>
> Now osd.0 should go UP and IN, use some CPU for a while, then drop to 0%
> cpu.  If osd.0 drops out now,  set noout, and shut it down.
>
> set noin again, and start osd.1.  When it's stable, do it again for osd.2.
>
> Once as many as possible are up and stable:
> ceph osd unset nobackfill
> ceph osd unset norecovery
>
> Now it should start recovering.  If your osds start dropping out now,  set
> noout, and shut down the ones that are having problems.
>
>
> The goal is to get all the stable osds up, in, and recovered.  Once that's
> done, we can figure out what to do with the unstable osds.
>
>
>
>
>
>
>
>
>
> On Thu, Jul 17, 2014 at 9:29 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>
> Sorry Craig.  I thought I sent both but second part didn't copy right.
>  For some reason over night MDS and MON decided to stop so I started it
> when I was running those commands.  Interestingly MDS didn't fail at the
> time like it used to.  So I thought something was being fixed?  Then I now
> realize MDS probably couldn't get to the data because OSD were down.  Now
> that I brought up the OSDs MDS crashed again. =P
>
> $ ceph osd tree
> # id    weight  type name       up/down reweight
> -1      5.46    root default
> -2      0               host OSD1
> -3      5.46            host OSD2
> 3       1.82                    osd.3   up      1
> 4       1.82                    osd.4   up      1
> 5       1.82                    osd.5   up      1
>
> $ ceph osd dump
> epoch 3125
> fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
> created 2014-02-08 01:57:34.086532
> modified 2014-07-17 23:24:10.823596
> flags
> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45
> stripe_width 0
> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
> max_osd 6
> osd.3 up   in  weight 1 up_from 3120 up_thru 3122 down_at 3116
> last_clean_interval [2858,3113) 192.168.1.31:6803/13623
> 192.168.2.31:6802/13623 192.168.2.31:6803/13623 192.168.1.31:6804/13623
> exists,up 4f86a418-6c67-4cb4-83a1-6c123c890036
> osd.4 up   in  weight 1 up_from 3121 up_thru 3122 down_at 3116
> last_clean_interval [2859,3113) 192.168.1.31:6806/13991
> 192.168.2.31:6804/13991 192.168.2.31:6805/13991 192.168.1.31:6807/13991
> exists,up 3d5e3843-7a47-44b0-b276-61c4b1d62900
> osd.5 up   in  weight 1 up_from 3118 up_thru 3118 down_at 3116
> last_clean_interval [2856,3113) 192.168.1.31:6800/13249
> 192.168.2.31:6800/13249 192.168.2.31:6801/13249 192.168.1.31:6801/13249
> exists,up eec86483-2f35-48a4-a154-2eaf26be06b9
> pg_temp 0.2 [4,3]
> pg_temp 0.a [4,5]
> pg_temp 0.c [3,4]
> pg_temp 0.10 [3,4]
> pg_temp 0.15 [3,5]
> pg_temp 0.17 [3,5]
> pg_temp 0.2f [4,5]
> pg_temp 0.3b [4,3]
> pg_temp 0.3c [3,5]
> pg_temp 0.3d [4,5]
> pg_temp 1.1 [4,3]
> pg_temp 1.9 [4,5]
> pg_temp 1.b [3,4]
> pg_temp 1.14 [3,5]
> pg_temp 1.16 [3,5]
> pg_temp 1.2e [4,5]
> pg_temp 1.3a [4,3]
> pg_temp 1.3b [3,5]
> pg_temp 1.3c [4,5]
> pg_temp 2.0 [4,3]
> pg_temp 2.8 [4,5]
> pg_temp 2.a [3,4]
> pg_temp 2.13 [3,5]
> pg_temp 2.15 [3,5]
> pg_temp 2.2d [4,5]
> pg_temp 2.39 [4,3]
> pg_temp 2.3a [3,5]
> pg_temp 2.3b [4,5]
> blacklist 192.168.1.20:6802/30894 expires 2014-07-17 23:48:10.823576
> blacklist 192.168.1.20:6801/30651 expires 2014-07-17 23:47:55.562984
>
> Regards,
> Hong
>
>
>   On Thursday, July 17, 2014 3:30 PM, Craig Lewis <
> clewis at centraldesktop.com> wrote:
>
>
> You gave me 'ceph osd dump' twice.... can I see 'ceph osd tree' too?
>
> Why are osd.3, osd.4, and osd.5 down?
>
>
> On Thu, Jul 17, 2014 at 11:45 AM, hjcho616 <hjcho616 at yahoo.com> wrote:
>
> Thank you for looking at this.  Below are the outputs you requested.
>
> # ceph osd dump
> epoch 3117
> fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
> created 2014-02-08 01:57:34.086532
> modified 2014-07-16 22:13:04.385914
> flags
> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45
> stripe_width 0
> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
> max_osd 6
> osd.3 down in  weight 1 up_from 2858 up_thru 3040 down_at 3116
> last_clean_interval [2830,2851) 192.168.1.31:6803/5127
> 192.168.2.31:6805/5127 192.168.2.31:6806/5127 192.168.1.31:6805/5127
> exists 4f86a418-6c67-4cb4-83a1-6c123c890036
> osd.4 down in  weight 1 up_from 2859 up_thru 3043 down_at 3116
> last_clean_interval [2835,2849) 192.168.1.31:6807/5310
> 192.168.2.31:6807/5310 192.168.2.31:6808/5310 192.168.1.31:6808/5310
> exists 3d5e3843-7a47-44b0-b276-61c4b1d62900
> osd.5 down in  weight 1 up_from 2856 up_thru 3042 down_at 3116
> last_clean_interval [2837,2853) 192.168.1.31:6800/4969
> 192.168.2.31:6801/4969 192.168.2.31:6804/4969 192.168.1.31:6801/4969
> exists eec86483-2f35-48a4-a154-2eaf26be06b9
>
> # ceph osd dump
> epoch 3117
> fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
> created 2014-02-08 01:57:34.086532
> modified 2014-07-16 22:13:04.385914
> flags
> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45
> stripe_width 0
> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
> max_osd 6
> osd.3 down in  weight 1 up_from 2858 up_thru 3040 down_at 3116
> last_clean_interval [2830,2851) 192.168.1.31:6803/5127
> 192.168.2.31:6805/5127 192.168.2.31:6806/5127 192.168.1.31:6805/5127
> exists 4f86a418-6c67-4cb4-83a1-6c123c890036
> osd.4 down in  weight 1 up_from 2859 up_thru 3043 down_at 3116
> last_clean_interval [2835,2849) 192.168.1.31:6807/5310
> 192.168.2.31:6807/5310 192.168.2.31:6808/5310 192.168.1.31:6808/5310
> exists 3d5e3843-7a47-44b0-b276-61c4b1d62900
> osd.5 down in  weight 1 up_from 2856 up_thru 3042 down_at 3116
> last_clean_interval [2837,2853) 192.168.1.31:6800/4969
> 192.168.2.31:6801/4969 192.168.2.31:6804/4969 192.168.1.31:6801/4969
> exists eec86483-2f35-48a4-a154-2eaf26be06b9
>
> Regards,
> Hong
>
>
>
>   On Thursday, July 17, 2014 12:02 PM, Craig Lewis <
> clewis at centraldesktop.com> wrote:
>
>
> I don't believe you can re-add an OSD after `ceph osd rm`, but it's worth
> a shot.  Let me see what I can do on my dev cluster.
>
> What does `ceph osd dump` and `ceph osd tree` say?  I want to make sure
> I'm starting from the same point you are.
>
>
>
> On Wed, Jul 16, 2014 at 7:39 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>
> I did a "ceph osd rm" for all three but I didn't do anything else to it
> afterwards.  Can this be added back?
>
> Regards,
> Hong
>
>
>   On Wednesday, July 16, 2014 6:54 PM, Craig Lewis <
> clewis at centraldesktop.com> wrote:
>
>
>  For some reason you ended up in my spam folder.  That might be why you
> didn't get any responses.
>
>
> Have you destroyed osd.0, osd.1, and osd.2?  If not, try bringing them up
> one a time.  You might have just one bad disk, which is much better than
> 50% of your disks.
>
> How is the ceph-osd process behaving when it hits the suicide timeout?  I
> had some problems a while back where the ceph-osd process would startup,
> start consuming ~200% CPU for a while, then get stuck using almost exactly
> 100% CPU.  It would get kicked out of the cluster for being unresponsive,
> then suicide.  Repeat.  If that's happening here, I can suggest some things
> to try.
>
>
>
>
>
> On Fri, Jul 11, 2014 at 9:12 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>
> I have 2 OSD machines with 3 OSD running on each.  One MDS server with 3
> daemons running.  Ran cephfs mostly on 0.78.  One night we lost power for
> split second.  MDS1 and OSD2 went down, OSD1 seemed OK, well turns out OSD1
> suffered most.  Those two machines rebooted and seemed ok except it had
> some inconsistencies.  I waited for a while, didn't fix itself.  So I
> issued 'ceph pg repair pgnum'.  It would try some and some OSD would crash.
>  Tried this for multiple days.  Got some PGs fixed... but mostly it would
> crash an OSD and stop recovering.  dmesg shows something like below.
>
>
>
> [  740.059498] traps: ceph-osd[5279] general protection ip:7f84e75ec75e
> sp:7fff00045bc0 error:0 in libtcmalloc.so.4.1.0[7f84e75b3000+4a000]
>
> and ceph osd log shows something like this.
>
>      -2> 2014-07-09 20:51:01.163571 7fe0f4617700  1 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had timed out after 60
>     -1> 2014-07-09 20:51:01.163609 7fe0f4617700  1 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had suicide timed out
> after 180
>      0> 2014-07-09 20:51:01.169542 7fe0f4617700 -1 common/HeartbeatMap.cc:
> In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
> const char*, time_t)' thread 7fe0f4617700 time 2014-07-09 20:51:01.163642
> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> long)+0x2eb) [0xad2cbb]
>  2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>  3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>  4: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>  5: (()+0x8062) [0x7fe0f797e062]
>  6: (clone()+0x6d) [0x7fe0f62bea3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 keyvaluestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.0.log
> --- end dump of recent events ---
> 2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal (Aborted) **
>  in thread 7fe0f4617700
>
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-osd() [0xaac562]
>  2: (()+0xf880) [0x7fe0f7985880]
>  3: (gsignal()+0x39) [0x7fe0f620e3a9]
>  4: (abort()+0x148) [0x7fe0f62114c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5]
>  6: (()+0x5e746) [0x7fe0f6af9746]
>  7: (()+0x5e773) [0x7fe0f6af9773]
>  8: (()+0x5e9b2) [0x7fe0f6af99b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0xb85b6a]
>  10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> long)+0x2eb) [0xad2cbb]
>  11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>  12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>  13: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>  14: (()+0x8062) [0x7fe0f797e062]
>  15: (clone()+0x6d) [0x7fe0f62bea3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- begin dump of recent events ---
>      0> 2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal
> (Aborted) **
>  in thread 7fe0f4617700
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-osd() [0xaac562]
>  2: (()+0xf880) [0x7fe0f7985880]
>  3: (gsignal()+0x39) [0x7fe0f620e3a9]
>  4: (abort()+0x148) [0x7fe0f62114c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5]
>  6: (()+0x5e746) [0x7fe0f6af9746]
>  7: (()+0x5e773) [0x7fe0f6af9773]
>  8: (()+0x5e9b2) [0x7fe0f6af99b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0xb85b6a]
>  10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> long)+0x2eb) [0xad2cbb]
>  11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>  12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>  13: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>  14: (()+0x8062) [0x7fe0f797e062]
>  15: (clone()+0x6d) [0x7fe0f62bea3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 keyvaluestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.0.log
> --- end dump of recent events ---
>
> After several attempts at it, osd.2 (which was on OSD1 which survived the
> power event) never comes up.  Looks like journal was corrupted
>
>     -1> 2014-07-09 20:44:14.992840 7f12256b67c0 -1 journal Unable to read
> past sequence 2157634 but header indicates the journal has committed up
> through 2157670, journal is corrupt
>      0> 2014-07-09 20:44:14.998742 7f12256b67c0 -1 os/FileJournal.cc: In
> function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&,
> bool*)' thread 7f12256b67c0 time 2014-07-09 20:44:14.993082
> os/FileJournal.cc: 1677: FAILED assert(0)
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&,
> bool*)+0x467) [0xa8d497]
>  2: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) [0x9dfebe]
>  3: (FileStore::mount()+0x32c9) [0x9b7939]
>  4: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa]
>  5: (main()+0x2237) [0x730837]
>  6: (__libc_start_main()+0xf5) [0x7f12236bdb45]
>  7: /usr/bin/ceph-osd() [0x734479]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 keyvaluestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.2.log
> --- end dump of recent events ---
> 2014-07-09 20:44:15.010090 7f12256b67c0 -1 *** Caught signal (Aborted) **
>  in thread 7f12256b67c0
> ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-osd() [0xaac562]
>  2: (()+0xf880) [0x7f1224e48880]
>  3: (gsignal()+0x39) [0x7f12236d13a9]
>  4: (abort()+0x148) [0x7f12236d44c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1223fbe5e5]
>  6: (()+0x5e746) [0x7f1223fbc746]
>  7: (()+0x5e773) [0x7f1223fbc773]
>  8: (()+0x5e9b2) [0x7f1223fbc9b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0xb85b6a]
>  10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&,
> bool*)+0x467) [0xa8d497]
>  11: (JournalingObjectStore::journal_replay(unsigned long)+0x22e)
> [0x9dfebe]
>  12: (FileStore::mount()+0x32c9) [0x9b7939]
>  13: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa]
>  14: (main()+0x2237) [0x730837]
>  15: (__libc_start_main()+0xf5) [0x7f12236bdb45]
>   16: /usr/bin/ceph-osd() [0x734479]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> --- begin dump of recent events ---
>      0> 2014-07-09 20:44:15.010090 7f12256b67c0 -1 *** Caught signal
> (Aborted) **
>  in thread 7f12256b67c0
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-osd() [0xaac562]
>  2: (()+0xf880) [0x7f1224e48880]
>  3: (gsignal()+0x39) [0x7f12236d13a9]
>  4: (abort()+0x148) [0x7f12236d44c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1223fbe5e5]
>  6: (()+0x5e746) [0x7f1223fbc746]
>  7: (()+0x5e773) [0x7f1223fbc773]
>  8: (()+0x5e9b2) [0x7f1223fbc9b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0xb85b6a]
>  10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&,
> bool*)+0x467) [0xa8d497]
>  11: (JournalingObjectStore::journal_replay(unsigned long)+0x22e)
> [0x9dfebe]
>  12: (FileStore::mount()+0x32c9) [0x9b7939]
>  13: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa]
>  14: (main()+0x2237) [0x730837]
>  15: (__libc_start_main()+0xf5) [0x7f12236bdb45]
>  16: /usr/bin/ceph-osd() [0x734479]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>  --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 keyvaluestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.2.log
> --- end dump of recent events ---
>
>
> So I thought maybe upgrading 0.82 would give it a better option at fixing
> things... so I did, now not only those OSDs fail (osd.1 is up but with 14M
> of memory only... I assume that's broky too), but MDS fails too.
>
> # /usr/bin/ceph-mds -i MDS1 --pid-file /var/run/ceph/mds.MDS1.pid -c
> /etc/ceph/ceph.conf --cluster ceph -f --debug-mds=20 --debug-journaler=10
> starting mds.MDS1 at :/0
> mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread
> 7f8e07c21700 time 2014-07-09 21:01:10.190965
> mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  3: (()+0x8062) [0x7f8e0fe3f062]
>  4: (clone()+0x6d) [0x7f8e0ebd3a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
> 2014-07-09 21:01:10.192936 7f8e07c21700 -1 mds/MDLog.cc: In function 'void
> MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09 21:01:10.190965
> mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  3: (()+0x8062) [0x7f8e0fe3f062]
>  4: (clone()+0x6d) [0x7f8e0ebd3a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>      0> 2014-07-09 21:01:10.192936 7f8e07c21700 -1 mds/MDLog.cc: In
> function 'void MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09
> 21:01:10.190965
> mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  3: (()+0x8062) [0x7f8e0fe3f062]
>  4: (clone()+0x6d) [0x7f8e0ebd3a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> terminate called after throwing an instance of 'ceph::FailedAssertion'
> *** Caught signal (Aborted) **
>  in thread 7f8e07c21700
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-mds() [0x8d81f2]
>  2: (()+0xf880) [0x7f8e0fe46880]
>  3: (gsignal()+0x39) [0x7f8e0eb233a9]
>  4: (abort()+0x148) [0x7f8e0eb264c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5]
>  6: (()+0x5e746) [0x7f8e0f40e746]
>  7: (()+0x5e773) [0x7f8e0f40e773]
>  8: (()+0x5e9b2) [0x7f8e0f40e9b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x9ab5da]
>  10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  12: (()+0x8062) [0x7f8e0fe3f062]
>  13: (clone()+0x6d) [0x7f8e0ebd3a3d]
> 2014-07-09 21:01:10.201968 7f8e07c21700 -1 *** Caught signal (Aborted) **
>  in thread 7f8e07c21700
>
>   ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-mds() [0x8d81f2]
>  2: (()+0xf880) [0x7f8e0fe46880]
>  3: (gsignal()+0x39) [0x7f8e0eb233a9]
>  4: (abort()+0x148) [0x7f8e0eb264c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5]
>  6: (()+0x5e746) [0x7f8e0f40e746]
>  7: (()+0x5e773) [0x7f8e0f40e773]
>  8: (()+0x5e9b2) [0x7f8e0f40e9b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x9ab5da]
>  10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  12: (()+0x8062) [0x7f8e0fe3f062]
>  13: (clone()+0x6d) [0x7f8e0ebd3a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>      0> 2014-07-09 21:01:10.201968 7f8e07c21700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f8e07c21700
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-mds() [0x8d81f2]
>  2: (()+0xf880) [0x7f8e0fe46880]
>  3: (gsignal()+0x39) [0x7f8e0eb233a9]
>  4: (abort()+0x148) [0x7f8e0eb264c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5]
>  6: (()+0x5e746) [0x7f8e0f40e746]
>  7: (()+0x5e773) [0x7f8e0f40e773]
>  8: (()+0x5e9b2) [0x7f8e0f40e9b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x9ab5da]
>  10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  12: (()+0x8062) [0x7f8e0fe3f062]
>  13: (clone()+0x6d) [0x7f8e0ebd3a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> Aborted
> root at MDS1:/var/log/ceph# /usr/bin/ceph-mds -i MDS1 --pid-file
> /var/run/ceph/mds.MDS1.pid -c /etc/ceph/ceph.conf --cluster ceph -f
> --debug-mds=20 --debug-journaler=10
> starting mds.MDS1 at :/0
>
>
> mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread
> 7fb7f7b83700 time 2014-07-09 23:21:43.383304
>
> mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>
>
>  1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>
>
>  2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>
>
>  3: (()+0x8062) [0x7fb7ffda1062]
>
>
>  4: (clone()+0x6d) [0x7fb7feb35a3d]
>
>
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In function 'void
> MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09
> 23:21:43.383304
>
> mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>
>
>
>
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>
>
>  1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>
>
>  2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>
>
>  3: (()+0x8062) [0x7fb7ffda1062]
>
>
>  4: (clone()+0x6d) [0x7fb7feb35a3d]
>
>
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>
>
>
>      0> 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In
> function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09
> 23:21:43.383304
> mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>
>
>
>
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>
>
>  1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>
>
>  2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>
>
>  3: (()+0x8062) [0x7fb7ffda1062]
>
>
>  4: (clone()+0x6d) [0x7fb7feb35a3d]
>
>
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>
>
>
> terminate called after throwing an instance of 'ceph::FailedAssertion'
>
>
> *** Caught signal (Aborted) **
>  in thread 7fb7f7b83700
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-mds() [0x8d81f2]
>  2: (()+0xf880) [0x7fb7ffda8880]
>  3: (gsignal()+0x39) [0x7fb7fea853a9]
>  4: (abort()+0x148) [0x7fb7fea884c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5]
>  6: (()+0x5e746) [0x7fb7ff370746]
>  7: (()+0x5e773) [0x7fb7ff370773]
>  8: (()+0x5e9b2) [0x7fb7ff3709b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x9ab5da]
>  10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  12: (()+0x8062) [0x7fb7ffda1062]
>  13: (clone()+0x6d) [0x7fb7feb35a3d]
> 2014-07-09 23:21:43.394324 7fb7f7b83700 -1 *** Caught signal (Aborted) **
>  in thread 7fb7f7b83700
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-mds() [0x8d81f2]
>  2: (()+0xf880) [0x7fb7ffda8880]
>  3: (gsignal()+0x39) [0x7fb7fea853a9]
>  4: (abort()+0x148) [0x7fb7fea884c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5]
>  6: (()+0x5e746) [0x7fb7ff370746]
>  7: (()+0x5e773) [0x7fb7ff370773]
>  8: (()+0x5e9b2) [0x7fb7ff3709b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x9ab5da]
>  10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  12: (()+0x8062) [0x7fb7ffda1062]
>  13: (clone()+0x6d) [0x7fb7feb35a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>      0> 2014-07-09 23:21:43.394324 7fb7f7b83700 -1 *** Caught signal
> (Aborted) **
>  in thread 7fb7f7b83700
>
>  ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>  1: /usr/bin/ceph-mds() [0x8d81f2]
>  2: (()+0xf880) [0x7fb7ffda8880]
>  3: (gsignal()+0x39) [0x7fb7fea853a9]
>  4: (abort()+0x148) [0x7fb7fea884c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5]
>  6: (()+0x5e746) [0x7fb7ff370746]
>  7: (()+0x5e773) [0x7fb7ff370773]
>  8: (()+0x5e9b2) [0x7fb7ff3709b2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x40a) [0x9ab5da]
>  10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>  11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>  12: (()+0x8062) [0x7fb7ffda1062]
>  13: (clone()+0x6d) [0x7fb7feb35a3d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> Aborted
>
> Felt like OSD1 was trashed so I removed osd.0 osd.1 osd.2.
>
> Still seeing below, and can't get MDS up.
>
> HEALTH_ERR 154 pgs degraded; 38 pgs inconsistent; 192 pgs stuck unclean;
> recovery 1374024/3513098 objects degraded (39.111%); 1374 scrub errors; mds
> cluster is degraded; mds MDS1 is laggy
>
> Is there something I can try to bring this file system up again? =P  I
> would like to access some of those data again.  Let me know if you need any
> additional info.  I was running Debian kernel 3.13.1 for first part, then
> 3.14.1 when I upgraded ceph to 0.82.
>
> Regards,
> Hong
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140718/4f84f82e/attachment.htm>