Power Outage

hjcho616@xxxxxxxxx (hjcho616) · Fri, 18 Jul 2014 14:34:13 -0700

Thanks Craig. ?I will try this soon. ?BTW should I upgrade to 0.80.4 first? ?The MDS journal issue seems to be one of the issue I am running into.

Regards,
Hong

On Friday, July 18, 2014 4:14 PM, Craig Lewis <clewis at centraldesktop.com> wrote:

If osd.3, osd.4, and osd.5 are stable, your cluster should be working again. ?What does ceph status say??

I was able to re-add removed osd.
Here's what I did on my dev cluster:
stop ceph-osd id=0
ceph osd down 0

ceph osd out 0

ceph osd rm 0
ceph osd crush rm osd.0

Now my osd tree and osd dump do not show osd.0. ?The cluster was degraded, but did not do any backfilling because I require 3x replication on 3 different hosts, and Ceph can't satisfy that with 2 osds.?

On the same host, I ran:
ceph osd create ? ? ? ?# Returned ID 0
start ceph-osd id=0

osd.0 started up and joined the cluster. ?Once peering completed, all of the PGs recovered quickly. ?I didn't have any writes on the cluster while I was doing this. ?

So it looks like you can just re-create and start those deleted osds.

In your situation, I would do the following. ?Before you start, go through this, and make sure you understand all the steps. ?Worst case, you can always undo this by removing the osds again, and you'll be back to where you are now.

ceph osd set nobackfill
ceph osd set norecovery
ceph osd set noin
ceph create osd ? # Should return 0. ?Abort if it doesn't.
ceph create osd ? # Should return 1. ?Abort if it doesn't.
ceph create osd ? # Should return 2. ?Abort if it doesn't.
start ceph-osd id=0

Watch ceph -w and top. ?Hopefully ceph-osd id=0 will use some CPU, then go UP, and drop to 0% cpu. ?If so,
ceph osd unset noin
restart ceph-osd id=0

Now osd.0 should go UP and IN, use some CPU for a while, then drop to 0% cpu. ?If osd.0 drops out now, ?set noout, and shut it down. ?

set noin again, and start osd.1. ?When it's stable, do it again for osd.2.

Once as many as possible are up and stable:
ceph osd unset nobackfill
ceph osd unset norecovery

Now it should start recovering. ?If your osds start dropping out now, ?set noout, and shut down the ones that are having problems. ?

The goal is to get all the stable osds up, in, and recovered. ?Once that's done, we can figure out what to do with the unstable osds.

On Thu, Jul 17, 2014 at 9:29 PM, hjcho616 <hjcho616 at yahoo.com> wrote:

Sorry Craig. ?I thought I sent both but second part didn't copy right. ?For some reason over night MDS and MON decided to stop so I started it when I was running those commands. ?Interestingly MDS didn't fail at the time like it used to. ?So I thought something was being fixed? ?Then I now realize MDS probably couldn't get to the data because OSD were down. ?Now that I brought up the OSDs MDS crashed again. =P?
>
>
>$ ceph osd tree
># id ? ?weight ?type name ? ? ? up/down reweight
>-1 ? ? ?5.46 ? ?root default
>-2 ? ? ?0 ? ? ? ? ? ? ? host OSD1
>-3 ? ? ?5.46 ? ? ? ? ? ?host OSD2
>3 ? ? ? 1.82 ? ? ? ? ? ? ? ? ? ?osd.3 ? up ? ? ?1
>4 ? ? ? 1.82 ? ? ? ? ? ? ? ? ? ?osd.4 ? up ? ? ?1
>5 ? ? ? 1.82 ? ? ? ? ? ? ? ? ? ?osd.5 ? up ? ? ?1
>
>
>$ ceph osd dump
>epoch 3125
>fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
>created 2014-02-08 01:57:34.086532
>modified 2014-07-17 23:24:10.823596
>flags
>pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 stripe_width 0
>pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>max_osd 6
>osd.3 up ? in ?weight 1 up_from 3120 up_thru 3122 down_at 3116 last_clean_interval [2858,3113) 192.168.1.31:6803/13623 192.168.2.31:6802/13623 192.168.2.31:6803/13623 192.168.1.31:6804/13623 exists,up 4f86a418-6c67-4cb4-83a1-6c123c890036
>osd.4 up ? in ?weight 1 up_from 3121 up_thru 3122 down_at 3116 last_clean_interval [2859,3113) 192.168.1.31:6806/13991 192.168.2.31:6804/13991 192.168.2.31:6805/13991 192.168.1.31:6807/13991 exists,up 3d5e3843-7a47-44b0-b276-61c4b1d62900
>osd.5 up ? in ?weight 1 up_from 3118 up_thru 3118 down_at 3116 last_clean_interval [2856,3113) 192.168.1.31:6800/13249 192.168.2.31:6800/13249 192.168.2.31:6801/13249 192.168.1.31:6801/13249 exists,up eec86483-2f35-48a4-a154-2eaf26be06b9
>pg_temp 0.2 [4,3]
>pg_temp 0.a [4,5]
>pg_temp 0.c [3,4]
>pg_temp 0.10 [3,4]
>pg_temp 0.15 [3,5]
>pg_temp 0.17 [3,5]
>pg_temp 0.2f [4,5]
>pg_temp 0.3b [4,3]
>pg_temp 0.3c [3,5]
>pg_temp 0.3d [4,5]
>pg_temp 1.1 [4,3]
>pg_temp 1.9 [4,5]
>pg_temp 1.b [3,4]
>pg_temp 1.14 [3,5]
>pg_temp 1.16 [3,5]
>pg_temp 1.2e [4,5]
>pg_temp 1.3a [4,3]
>pg_temp 1.3b [3,5]
>pg_temp 1.3c [4,5]
>pg_temp 2.0 [4,3]
>pg_temp 2.8 [4,5]
>pg_temp 2.a [3,4]
>pg_temp 2.13 [3,5]
>pg_temp 2.15 [3,5]
>pg_temp 2.2d [4,5]
>pg_temp 2.39 [4,3]
>pg_temp 2.3a [3,5]
>pg_temp 2.3b [4,5]
>blacklist 192.168.1.20:6802/30894 expires 2014-07-17 23:48:10.823576
>blacklist 192.168.1.20:6801/30651 expires 2014-07-17 23:47:55.562984
>
>
>Regards,
>Hong
>
>
>
>On Thursday, July 17, 2014 3:30 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
> 
>
>
>You gave me 'ceph osd dump' twice.... can I see 'ceph osd tree' too?
>
>
>Why are osd.3, osd.4, and osd.5 down?
>
>
>
>On Thu, Jul 17, 2014 at 11:45 AM, hjcho616 <hjcho616 at yahoo.com> wrote:
>
>Thank you for looking at this. ?Below are the outputs you requested.
>>
>>
>># ceph osd dump
>>epoch 3117
>>fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
>>created 2014-02-08 01:57:34.086532
>>modified 2014-07-16 22:13:04.385914
>>flags?
>>pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 stripe_width 0
>>pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>max_osd 6
>>osd.3 down in ?weight 1 up_from 2858 up_thru 3040 down_at 3116 last_clean_interval [2830,2851) 192.168.1.31:6803/5127 192.168.2.31:6805/5127 192.168.2.31:6806/5127 192.168.1.31:6805/5127 exists 4f86a418-6c67-4cb4-83a1-6c123c890036
>>osd.4 down in ?weight 1 up_from 2859 up_thru 3043 down_at 3116 last_clean_interval [2835,2849) 192.168.1.31:6807/5310 192.168.2.31:6807/5310 192.168.2.31:6808/5310 192.168.1.31:6808/5310 exists 3d5e3843-7a47-44b0-b276-61c4b1d62900
>>osd.5 down in ?weight 1 up_from 2856 up_thru 3042 down_at 3116 last_clean_interval [2837,2853) 192.168.1.31:6800/4969 192.168.2.31:6801/4969 192.168.2.31:6804/4969 192.168.1.31:6801/4969 exists eec86483-2f35-48a4-a154-2eaf26be06b9
>>
>>
>># ceph osd dump
>>epoch 3117
>>fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
>>created 2014-02-08 01:57:34.086532
>>modified 2014-07-16 22:13:04.385914
>>flags?
>>pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 stripe_width 0
>>pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>max_osd 6
>>osd.3 down in ?weight 1 up_from 2858 up_thru 3040 down_at 3116 last_clean_interval [2830,2851) 192.168.1.31:6803/5127 192.168.2.31:6805/5127 192.168.2.31:6806/5127 192.168.1.31:6805/5127 exists 4f86a418-6c67-4cb4-83a1-6c123c890036
>>osd.4 down in ?weight 1 up_from 2859 up_thru 3043 down_at 3116 last_clean_interval [2835,2849) 192.168.1.31:6807/5310 192.168.2.31:6807/5310 192.168.2.31:6808/5310 192.168.1.31:6808/5310 exists 3d5e3843-7a47-44b0-b276-61c4b1d62900
>>osd.5 down in ?weight 1 up_from 2856 up_thru 3042 down_at 3116 last_clean_interval [2837,2853) 192.168.1.31:6800/4969 192.168.2.31:6801/4969 192.168.2.31:6804/4969 192.168.1.31:6801/4969 exists eec86483-2f35-48a4-a154-2eaf26be06b9
>>
>>
>>Regards,
>>Hong
>>
>>
>>
>>
>>
>>On Thursday, July 17, 2014 12:02 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
>> 
>>
>>
>>I don't believe you can re-add an OSD after `ceph osd rm`, but it's worth a shot. ?Let me see what I can do on my dev cluster.
>>
>>
>>What does `ceph osd dump` and `ceph osd tree` say? ?I want to make sure I'm starting from the same point you are.
>>
>>
>>
>>
>>
>>On Wed, Jul 16, 2014 at 7:39 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>>
>>I did a "ceph osd rm" for all three but I didn't do anything else to it afterwards. ?Can this be added back?
>>>
>>>
>>>Regards,
>>>Hong
>>>
>>>
>>>
>>>On Wednesday, July 16, 2014 6:54 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
>>> 
>>>
>>>
>>>For some reason you ended up in my spam folder. ?That might be why you didn't get any responses.
>>>
>>>
>>>
>>>
>>>Have you destroyed osd.0, osd.1, and osd.2? ?If not, try bringing them up one a time. ?You might have just one bad disk, which is much better than 50% of your disks.
>>>
>>>
>>>
>>>How is the ceph-osd process behaving when it hits the suicide timeout? ?I had some problems a while back where the ceph-osd process would startup, start consuming ~200% CPU for a while, then get stuck using almost exactly 100% CPU. ?It would get kicked out of the cluster for being unresponsive, then suicide. ?Repeat. ?If that's happening here, I can suggest some things to try.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>On Fri, Jul 11, 2014 at 9:12 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>>>
>>>I have 2 OSD machines with 3 OSD running on each. ?One MDS server with 3 daemons running. ?Ran cephfs mostly on 0.78. ?One night we lost power for split second. ?MDS1 and OSD2 went down, OSD1 seemed OK, well turns out OSD1 suffered most. ?Those two machines rebooted and seemed ok except it had some inconsistencies. ?I waited for a while, didn't fix itself. ?So I issued 'ceph pg repair pgnum'. ?It would try some and some OSD would crash. ?Tried this for multiple days. ?Got some PGs fixed... but mostly it would crash an OSD and stop recovering. ?dmesg shows something like below.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>[ ?740.059498] traps: ceph-osd[5279] general protection ip:7f84e75ec75e sp:7fff00045bc0 error:0 in libtcmalloc.so.4.1.0[7f84e75b3000+4a000]
>>>>
>>>>
>>>>
>>>>and ceph osd log shows something like this.
>>>>
>>>>
>>>>? ? -2> 2014-07-09 20:51:01.163571 7fe0f4617700 ?1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had timed out after 60
>>>>? ? -1> 2014-07-09 20:51:01.163609 7fe0f4617700 ?1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had suicide timed out after 180
>>>>? ? ?0> 2014-07-09 20:51:01.169542 7fe0f4617700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe0f4617700 time 2014-07-09 20:51:01.163642
>>>>common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>>>>
>>>>
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0xad2cbb]
>>>>?2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>>>>?3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>>>>?4: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>>>>?5: (()+0x8062) [0x7fe0f797e062]
>>>>?6: (clone()+0x6d) [0x7fe0f62bea3d]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>>--- logging levels ---
>>>>? ?0/ 5 none
>>>>? ?0/ 1 lockdep
>>>>? ?0/ 1 context
>>>>? ?1/ 1 crush
>>>>? ?1/ 5 mds
>>>>? ?1/ 5 mds_balancer
>>>>? ?1/ 5 mds_locker
>>>>? ?1/ 5 mds_log
>>>>? ?1/ 5 mds_log_expire
>>>>? ?1/ 5 mds_migrator
>>>>? ?0/ 1 buffer
>>>>? ?0/ 1 timer
>>>>? ?0/ 1 filer
>>>>? ?0/ 1 striper
>>>>? ?0/ 1 objecter
>>>>? ?0/ 5 rados
>>>>? ?0/ 5 rbd
>>>>? ?0/ 5 journaler
>>>>? ?0/ 5 objectcacher
>>>>? ?0/ 5 client
>>>>? ?0/ 5 osd
>>>>? ?0/ 5 optracker
>>>>? ?0/ 5 objclass
>>>>? ?1/ 3 filestore
>>>>? ?1/ 3 keyvaluestore
>>>>? ?1/ 3 journal
>>>>? ?0/ 5 ms
>>>>? ?1/ 5 mon
>>>>? ?0/10 monc
>>>>? ?1/ 5 paxos
>>>>? ?0/ 5 tp
>>>>? ?1/ 5 auth
>>>>? ?1/ 5 crypto
>>>>? ?1/ 1 finisher
>>>>? ?1/ 5 heartbeatmap
>>>>? ?1/ 5 perfcounter
>>>>? ?1/ 5 rgw
>>>>? ?1/ 5 javaclient
>>>>? ?1/ 5 asok
>>>>? ?1/ 1 throttle
>>>>? -2/-2 (syslog threshold)
>>>>? -1/-1 (stderr threshold)
>>>>? max_recent ? ? 10000
>>>>? max_new ? ? ? ? 1000
>>>>? log_file /var/log/ceph/ceph-osd.0.log
>>>>--- end dump of recent events ---
>>>>2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal (Aborted) **
>>>>?in thread 7fe0f4617700
>>>>
>>>>
>>>>
>>>>
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: /usr/bin/ceph-osd() [0xaac562]
>>>>?2: (()+0xf880) [0x7fe0f7985880]
>>>>?3: (gsignal()+0x39) [0x7fe0f620e3a9]
>>>>?4: (abort()+0x148) [0x7fe0f62114c8]
>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5]
>>>>?6: (()+0x5e746) [0x7fe0f6af9746]
>>>>?7: (()+0x5e773) [0x7fe0f6af9773]
>>>>?8: (()+0x5e9b2) [0x7fe0f6af99b2]
>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>>>>?10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0xad2cbb]
>>>>?11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>>>>?12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>>>>?13: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>>>>?14: (()+0x8062) [0x7fe0f797e062]
>>>>?15: (clone()+0x6d) [0x7fe0f62bea3d]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>>--- begin dump of recent events ---
>>>>? ? ?0> 2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal (Aborted) **
>>>>?in thread 7fe0f4617700
>>>>
>>>>
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: /usr/bin/ceph-osd() [0xaac562]
>>>>?2: (()+0xf880) [0x7fe0f7985880]
>>>>?3: (gsignal()+0x39) [0x7fe0f620e3a9]
>>>>?4: (abort()+0x148) [0x7fe0f62114c8]
>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5]
>>>>?6: (()+0x5e746) [0x7fe0f6af9746]
>>>>?7: (()+0x5e773) [0x7fe0f6af9773]
>>>>?8: (()+0x5e9b2) [0x7fe0f6af99b2]
>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>>>>?10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0xad2cbb]
>>>>?11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>>>>?12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>>>>?13: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>>>>?14: (()+0x8062) [0x7fe0f797e062]
>>>>?15: (clone()+0x6d) [0x7fe0f62bea3d]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>>--- logging levels ---
>>>>? ?0/ 5 none
>>>>? ?0/ 1 lockdep
>>>>? ?0/ 1 context
>>>>? ?1/ 1 crush
>>>>? ?1/ 5 mds
>>>>? ?1/ 5 mds_balancer
>>>>? ?1/ 5 mds_locker
>>>>? ?1/ 5 mds_log
>>>>? ?1/ 5 mds_log_expire
>>>>? ?1/ 5 mds_migrator
>>>>? ?0/ 1 buffer
>>>>? ?0/ 1 timer
>>>>? ?0/ 1 filer
>>>>? ?0/ 1 striper
>>>>? ?0/ 1 objecter
>>>>? ?0/ 5 rados
>>>>? ?0/ 5 rbd
>>>>? ?0/ 5 journaler
>>>>? ?0/ 5 objectcacher
>>>>? ?0/ 5 client
>>>>? ?0/ 5 osd
>>>>? ?0/ 5 optracker
>>>>? ?0/ 5 objclass
>>>>? ?1/ 3 filestore
>>>>? ?1/ 3 keyvaluestore
>>>>? ?1/ 3 journal
>>>>? ?0/ 5 ms
>>>>? ?1/ 5 mon
>>>>? ?0/10 monc
>>>>? ?1/ 5 paxos
>>>>? ?0/ 5 tp
>>>>? ?1/ 5 auth
>>>>? ?1/ 5 crypto
>>>>? ?1/ 1 finisher
>>>>? ?1/ 5 heartbeatmap
>>>>? ?1/ 5 perfcounter
>>>>? ?1/ 5 rgw
>>>>? ?1/ 5 javaclient
>>>>? ?1/ 5 asok
>>>>? ?1/ 1 throttle
>>>>? -2/-2 (syslog threshold)
>>>>? -1/-1 (stderr threshold)
>>>>? max_recent ? ? 10000
>>>>? max_new ? ? ? ? 1000
>>>>? log_file /var/log/ceph/ceph-osd.0.log
>>>>--- end dump of recent events ---
>>>>
>>>>
>>>>After several attempts at it, osd.2 (which was on OSD1 which survived the power event) never comes up. ?Looks like journal was corrupted
>>>>
>>>>
>>>>? ? -1> 2014-07-09 20:44:14.992840 7f12256b67c0 -1 journal Unable to read past sequence 2157634 but header indicates the journal has committed up through 2157670, journal is corrupt
>>>>? ? ?0> 2014-07-09 20:44:14.998742 7f12256b67c0 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, bool*)' thread 7f12256b67c0 time 2014-07-09 20:44:14.993082
>>>>os/FileJournal.cc: 1677: FAILED assert(0)
>>>>
>>>>
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x467) [0xa8d497]
>>>>?2: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) [0x9dfebe]
>>>>?3: (FileStore::mount()+0x32c9) [0x9b7939]
>>>>?4: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa]
>>>>?5: (main()+0x2237) [0x730837]
>>>>?6: (__libc_start_main()+0xf5) [0x7f12236bdb45]
>>>>?7: /usr/bin/ceph-osd() [0x734479]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>>--- logging levels ---
>>>>? ?0/ 5 none
>>>>? ?0/ 1 lockdep
>>>>? ?0/ 1 context
>>>>? ?1/ 1 crush
>>>>? ?1/ 5 mds
>>>>? ?1/ 5 mds_balancer
>>>>? ?1/ 5 mds_locker
>>>>? ?1/ 5 mds_log
>>>>? ?1/ 5 mds_log_expire
>>>>? ?1/ 5 mds_migrator
>>>>? ?0/ 1 buffer
>>>>? ?0/ 1 timer
>>>>? ?0/ 1 filer
>>>>? ?0/ 1 striper
>>>>? ?0/ 1 objecter
>>>>? ?0/ 5 rados
>>>>? ?0/ 5 rbd
>>>>? ?0/ 5 journaler
>>>>? ?0/ 5 objectcacher
>>>>? ?0/ 5 client
>>>>? ?0/ 5 osd
>>>>? ?0/ 5 optracker
>>>>? ?0/ 5 objclass
>>>>? ?1/ 3 filestore
>>>>? ?1/ 3 keyvaluestore
>>>>? ?1/ 3 journal
>>>>? ?0/ 5 ms
>>>>? ?1/ 5 mon
>>>>? ?0/10 monc
>>>>? ?1/ 5 paxos
>>>>? ?0/ 5 tp
>>>>? ?1/ 5 auth
>>>>? ?1/ 5 crypto
>>>>? ?1/ 1 finisher
>>>>? ?1/ 5 heartbeatmap
>>>>? ?1/ 5 perfcounter
>>>>? ?1/ 5 rgw
>>>>? ?1/ 5 javaclient
>>>>? ?1/ 5 asok
>>>>? ?1/ 1 throttle
>>>>? -2/-2 (syslog threshold)
>>>>? -1/-1 (stderr threshold)
>>>>? max_recent ? ? 10000
>>>>? max_new ? ? ? ? 1000
>>>>? log_file /var/log/ceph/ceph-osd.2.log
>>>>--- end dump of recent events ---
>>>>2014-07-09 20:44:15.010090 7f12256b67c0 -1 *** Caught signal (Aborted) **
>>>>?in thread 7f12256b67c0
>>>>ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: /usr/bin/ceph-osd() [0xaac562]
>>>>?2: (()+0xf880) [0x7f1224e48880]
>>>>?3: (gsignal()+0x39) [0x7f12236d13a9]
>>>>?4: (abort()+0x148) [0x7f12236d44c8]
>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1223fbe5e5]
>>>>?6: (()+0x5e746) [0x7f1223fbc746]
>>>>?7: (()+0x5e773) [0x7f1223fbc773]
>>>>?8: (()+0x5e9b2) [0x7f1223fbc9b2]
>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>>>>?10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x467) [0xa8d497]
>>>>?11: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) [0x9dfebe]
>>>>?12: (FileStore::mount()+0x32c9) [0x9b7939]
>>>>?13: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa]
>>>>?14: (main()+0x2237) [0x730837]
>>>>?15: (__libc_start_main()+0xf5) [0x7f12236bdb45]
>>>>?16: /usr/bin/ceph-osd() [0x734479]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>>--- begin dump of recent events ---
>>>>? ? ?0> 2014-07-09 20:44:15.010090 7f12256b67c0 -1 *** Caught signal (Aborted) **
>>>>?in thread 7f12256b67c0
>>>>
>>>>
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: /usr/bin/ceph-osd() [0xaac562]
>>>>?2: (()+0xf880) [0x7f1224e48880]
>>>>?3: (gsignal()+0x39) [0x7f12236d13a9]
>>>>?4: (abort()+0x148) [0x7f12236d44c8]
>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1223fbe5e5]
>>>>?6: (()+0x5e746) [0x7f1223fbc746]
>>>>?7: (()+0x5e773) [0x7f1223fbc773]
>>>>?8: (()+0x5e9b2) [0x7f1223fbc9b2]
>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>>>>?10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x467) [0xa8d497]
>>>>?11: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) [0x9dfebe]
>>>>?12: (FileStore::mount()+0x32c9) [0x9b7939]
>>>>?13: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa]
>>>>?14: (main()+0x2237) [0x730837]
>>>>?15: (__libc_start_main()+0xf5) [0x7f12236bdb45]
>>>>?16: /usr/bin/ceph-osd() [0x734479]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>>--- logging levels ---
>>>>? ?0/ 5 none
>>>>? ?0/ 1 lockdep
>>>>? ?0/ 1 context
>>>>? ?1/ 1 crush
>>>>? ?1/ 5 mds
>>>>? ?1/ 5 mds_balancer
>>>>? ?1/ 5 mds_locker
>>>>? ?1/ 5 mds_log
>>>>? ?1/ 5 mds_log_expire
>>>>? ?1/ 5 mds_migrator
>>>>? ?0/ 1 buffer
>>>>? ?0/ 1 timer
>>>>? ?0/ 1 filer
>>>>? ?0/ 1 striper
>>>>? ?0/ 1 objecter
>>>>? ?0/ 5 rados
>>>>? ?0/ 5 rbd
>>>>? ?0/ 5 journaler
>>>>? ?0/ 5 objectcacher
>>>>? ?0/ 5 client
>>>>? ?0/ 5 osd
>>>>? ?0/ 5 optracker
>>>>? ?0/ 5 objclass
>>>>? ?1/ 3 filestore
>>>>? ?1/ 3 keyvaluestore
>>>>? ?1/ 3 journal
>>>>? ?0/ 5 ms
>>>>? ?1/ 5 mon
>>>>? ?0/10 monc
>>>>? ?1/ 5 paxos
>>>>? ?0/ 5 tp
>>>>? ?1/ 5 auth
>>>>? ?1/ 5 crypto
>>>>? ?1/ 1 finisher
>>>>? ?1/ 5 heartbeatmap
>>>>? ?1/ 5 perfcounter
>>>>? ?1/ 5 rgw
>>>>? ?1/ 5 javaclient
>>>>? ?1/ 5 asok
>>>>? ?1/ 1 throttle
>>>>? -2/-2 (syslog threshold)
>>>>? -1/-1 (stderr threshold)
>>>>? max_recent ? ? 10000
>>>>? max_new ? ? ? ? 1000
>>>>? log_file /var/log/ceph/ceph-osd.2.log
>>>>--- end dump of recent events ---
>>>>
>>>>
>>>>
>>>>
>>>>So I thought maybe upgrading 0.82 would give it a better option at fixing things... so I did, now not only those OSDs fail (osd.1 is up but with 14M of memory only... I assume that's broky too), but MDS fails too.
>>>>
>>>>
>>>># /usr/bin/ceph-mds -i MDS1 --pid-file /var/run/ceph/mds.MDS1.pid -c /etc/ceph/ceph.conf --cluster ceph -f --debug-mds=20 --debug-journaler=10
>>>>starting mds.MDS1 at :/0
>>>>mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09 21:01:10.190965
>>>>mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>?2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>?3: (()+0x8062) [0x7f8e0fe3f062]
>>>>?4: (clone()+0x6d) [0x7f8e0ebd3a3d]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>2014-07-09 21:01:10.192936 7f8e07c21700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09 21:01:10.190965
>>>>mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>>>>
>>>>
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>?2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>?3: (()+0x8062) [0x7f8e0fe3f062]
>>>>?4: (clone()+0x6d) [0x7f8e0ebd3a3d]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>>? ? ?0> 2014-07-09 21:01:10.192936 7f8e07c21700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09 21:01:10.190965
>>>>mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>>>>
>>>>
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>?2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>?3: (()+0x8062) [0x7f8e0fe3f062]
>>>>?4: (clone()+0x6d) [0x7f8e0ebd3a3d]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>>terminate called after throwing an instance of 'ceph::FailedAssertion'
>>>>*** Caught signal (Aborted) **
>>>>?in thread 7f8e07c21700
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: /usr/bin/ceph-mds() [0x8d81f2]
>>>>?2: (()+0xf880) [0x7f8e0fe46880]
>>>>?3: (gsignal()+0x39) [0x7f8e0eb233a9]
>>>>?4: (abort()+0x148) [0x7f8e0eb264c8]
>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5]
>>>>?6: (()+0x5e746) [0x7f8e0f40e746]
>>>>?7: (()+0x5e773) [0x7f8e0f40e773]
>>>>?8: (()+0x5e9b2) [0x7f8e0f40e9b2]
>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
>>>>?10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>?11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>?12: (()+0x8062) [0x7f8e0fe3f062]
>>>>?13: (clone()+0x6d) [0x7f8e0ebd3a3d]
>>>>2014-07-09 21:01:10.201968 7f8e07c21700 -1 *** Caught signal (Aborted) **
>>>>?in thread 7f8e07c21700
>>>>
>>>>
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: /usr/bin/ceph-mds() [0x8d81f2]
>>>>?2: (()+0xf880) [0x7f8e0fe46880]
>>>>?3: (gsignal()+0x39) [0x7f8e0eb233a9]
>>>>?4: (abort()+0x148) [0x7f8e0eb264c8]
>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5]
>>>>?6: (()+0x5e746) [0x7f8e0f40e746]
>>>>?7: (()+0x5e773) [0x7f8e0f40e773]
>>>>?8: (()+0x5e9b2) [0x7f8e0f40e9b2]
>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
>>>>?10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>?11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>?12: (()+0x8062) [0x7f8e0fe3f062]
>>>>?13: (clone()+0x6d) [0x7f8e0ebd3a3d]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>>? ? ?0> 2014-07-09 21:01:10.201968 7f8e07c21700 -1 *** Caught signal (Aborted) **
>>>>?in thread 7f8e07c21700
>>>>
>>>>
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: /usr/bin/ceph-mds() [0x8d81f2]
>>>>?2: (()+0xf880) [0x7f8e0fe46880]
>>>>?3: (gsignal()+0x39) [0x7f8e0eb233a9]
>>>>?4: (abort()+0x148) [0x7f8e0eb264c8]
>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5]
>>>>?6: (()+0x5e746) [0x7f8e0f40e746]
>>>>?7: (()+0x5e773) [0x7f8e0f40e773]
>>>>?8: (()+0x5e9b2) [0x7f8e0f40e9b2]
>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
>>>>?10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>?11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>?12: (()+0x8062) [0x7f8e0fe3f062]
>>>>?13: (clone()+0x6d) [0x7f8e0ebd3a3d]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>>Aborted
>>>>root at MDS1:/var/log/ceph# /usr/bin/ceph-mds -i MDS1 --pid-file /var/run/ceph/mds.MDS1.pid -c /etc/ceph/ceph.conf --cluster ceph -f --debug-mds=20 --debug-journaler=10
>>>>starting mds.MDS1 at :/0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 23:21:43.383304 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?3: (()+0x8062) [0x7fb7ffda1062] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>?4: (clone()+0x6d) [0x7fb7feb35a3d] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 23:21:43.383304 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?3: (()+0x8062) [0x7fb7ffda1062] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>?4: (clone()+0x6d) [0x7fb7feb35a3d] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>? ? ?0> 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 23:21:43.383304 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?3: (()+0x8062) [0x7fb7ffda1062] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>?4: (clone()+0x6d) [0x7fb7feb35a3d] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>terminate called after throwing an instance of 'ceph::FailedAssertion' ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>*** Caught signal (Aborted) **
>>>>?in thread 7fb7f7b83700
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: /usr/bin/ceph-mds() [0x8d81f2]
>>>>?2: (()+0xf880) [0x7fb7ffda8880]
>>>>?3: (gsignal()+0x39) [0x7fb7fea853a9]
>>>>?4: (abort()+0x148) [0x7fb7fea884c8]
>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5]
>>>>?6: (()+0x5e746) [0x7fb7ff370746]
>>>>?7: (()+0x5e773) [0x7fb7ff370773]
>>>>?8: (()+0x5e9b2) [0x7fb7ff3709b2]
>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
>>>>?10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>?11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>?12: (()+0x8062) [0x7fb7ffda1062]
>>>>?13: (clone()+0x6d) [0x7fb7feb35a3d]
>>>>2014-07-09 23:21:43.394324 7fb7f7b83700 -1 *** Caught signal (Aborted) **
>>>>?in thread 7fb7f7b83700
>>>>
>>>>
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: /usr/bin/ceph-mds() [0x8d81f2]
>>>>?2: (()+0xf880) [0x7fb7ffda8880]
>>>>?3: (gsignal()+0x39) [0x7fb7fea853a9]
>>>>?4: (abort()+0x148) [0x7fb7fea884c8]
>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5]
>>>>?6: (()+0x5e746) [0x7fb7ff370746]
>>>>?7: (()+0x5e773) [0x7fb7ff370773]
>>>>?8: (()+0x5e9b2) [0x7fb7ff3709b2]
>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
>>>>?10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>?11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>?12: (()+0x8062) [0x7fb7ffda1062]
>>>>?13: (clone()+0x6d) [0x7fb7feb35a3d]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>>? ? ?0> 2014-07-09 23:21:43.394324 7fb7f7b83700 -1 *** Caught signal (Aborted) **
>>>>?in thread 7fb7f7b83700
>>>>
>>>>
>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>?1: /usr/bin/ceph-mds() [0x8d81f2]
>>>>?2: (()+0xf880) [0x7fb7ffda8880]
>>>>?3: (gsignal()+0x39) [0x7fb7fea853a9]
>>>>?4: (abort()+0x148) [0x7fb7fea884c8]
>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5]
>>>>?6: (()+0x5e746) [0x7fb7ff370746]
>>>>?7: (()+0x5e773) [0x7fb7ff370773]
>>>>?8: (()+0x5e9b2) [0x7fb7ff3709b2]
>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
>>>>?10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>?11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>?12: (()+0x8062) [0x7fb7ffda1062]
>>>>?13: (clone()+0x6d) [0x7fb7feb35a3d]
>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>>Aborted
>>>>
>>>>
>>>>Felt like OSD1 was trashed so I removed osd.0 osd.1 osd.2. ?
>>>>
>>>>
>>>>Still seeing below, and can't get MDS up.
>>>>
>>>>
>>>>HEALTH_ERR 154 pgs degraded; 38 pgs inconsistent; 192 pgs stuck unclean; recovery 1374024/3513098 objects degraded (39.111%); 1374 scrub errors; mds cluster is degraded; mds MDS1 is laggy
>>>>
>>>>
>>>>Is there something I can try to bring this file system up again? =P ?I would like to access some of those data again. ?Let me know if you need any additional info. ?I was running Debian kernel 3.13.1 for first part, then 3.14.1 when I upgraded ceph to 0.82.
>>>>
>>>>
>>>>Regards,
>>>>Hong
>>>>
>>>>
>>>>_______________________________________________
>>>>ceph-users mailing list
>>>>ceph-users at lists.ceph.com
>>>>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140718/9028c8ec/attachment.htm>