Power Outage

hjcho616@xxxxxxxxx (hjcho616) · Sun, 20 Jul 2014 19:01:51 -0700

Based on your suggestion here is what I did.

# ceph osd set nobackfill
set nobackfill
# ceph osd set norecovery
Invalid command: ?norecovery not in pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub|notieragent
osd set pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub|notieragent : ?set <key>
Error EINVAL: invalid command
# ceph osd set norecover
set norecover
# ceph osd set noin
set noin
# ceph create osd
no valid command found; 10 closest matches:
osd tier remove <poolname> <poolname>
osd tier cache-mode <poolname> none|writeback|forward|readonly
osd thrash <int[0-]>
osd tier add <poolname> <poolname> {--force-nonempty}
osd pool stats {<name>}
osd reweight-by-utilization {<int[100-]>}
osd pool set <poolname> size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid <val> {--yes-i-really-mean-it}
osd pool set-quota <poolname> max_objects|max_bytes <val>
osd pool rename <poolname> <poolname>
osd pool get <poolname> size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|auid
Error EINVAL: invalid command
# ceph osd create
0
# ceph osd create
1
# ceph osd create
2
# start ceph-osd id=0
bash: start: command not found
# /etc/init.d/ceph start osd.0
=== osd.0 ===?
2014-07-18 21:21:37.207159 7ff2c64d7700 ?0 librados: osd.0 authentication error (1) Operation not permitted
Error connecting to cluster: PermissionError
failed: 'timeout 10 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.0 --keyring=/var/lib/ceph/osd/ceph-0/keyring osd crush create-or-move -- 0 1.82 host=OSD1 root=default'
# ceph status
? ? cluster 9b2c9bca-112e-48b0-86fc-587ef9a52948
? ? ?health HEALTH_ERR 164 pgs degraded; 38 pgs inconsistent; 192 pgs stuck unclean; recovery 1484224/3513098 objects degraded (42.248%); 1374 scrub errors; mds cluster is degraded; mds MDS1 is laggy; noin,nobackfill,norecover flag(s) set
? ? ?monmap e1: 1 mons at {MDS1=192.168.1.20:6789/0}, election epoch 1, quorum 0 MDS1
? ? ?mdsmap e7182: 1/1/1 up {0=MDS1=up:replay(laggy or crashed)}
? ? ?osdmap e3133: 6 osds: 3 up, 3 in
? ? ? ? ? ? flags noin,nobackfill,norecover
? ? ? pgmap v309437: 192 pgs, 3 pools, 1571 GB data, 1715 kobjects
? ? ? ? ? ? 1958 GB used, 3627 GB / 5586 GB avail
? ? ? ? ? ? 1484224/3513098 objects degraded (42.248%)
? ? ? ? ? ? ? ? ?131 active+degraded
? ? ? ? ? ? ? ? ? 23 active+remapped
? ? ? ? ? ? ? ? ? 33 active+degraded+inconsistent
? ? ? ? ? ? ? ? ? ?5 active+remapped+inconsistent
# ceph osd stat
? ? ?osdmap e3133: 6 osds: 3 up, 3 in
? ? ? ? ? ? flags noin,nobackfill,norecover
# ceph auth get-or-create osd.0 mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-0/keyring

# /etc/init.d/ceph start osd.0
=== osd.0 ===?
create-or-move updating item name 'osd.0' weight 1.82 at location {host=OSD1,root=default} to crush map
Starting Ceph osd.0 on OSD1...
starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
root at OSD1:/home/genie# ceph auth get-or-create osd.1 mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-1/keyring
root at OSD1:/home/genie# ceph auth get-or-create osd.2 mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-2/keyring
root at OSD1:/home/genie# /etc/init.d/ceph start osd.1
=== osd.1 ===?
failed: 'timeout 10 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.1 --keyring=/var/lib/ceph/osd/ceph-1/keyring osd crush create-or-move -- 1 1.82 host=OSD1 root=default'
# /etc/init.d/ceph start osd.2
=== osd.2 ===?
create-or-move updating item name 'osd.2' weight 1.82 at location {host=OSD1,root=default} to crush map
Starting Ceph osd.2 on OSD1...
starting osd.2 at :/0 osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
# /etc/init.d/ceph start osd.1
=== osd.1 ===?
failed: 'timeout 10 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.1 --keyring=/var/lib/ceph/osd/ceph-1/keyring osd crush create-or-move -- 1 1.82 host=OSD1 root=default'
# ceph health
Segmentation fault
# ceph health
Bus error
# ceph health
HEALTH_ERR 164 pgs degraded; 38 pgs inconsistent; 192 pgs stuck unclean; recovery 1484224/3513098 objects degraded (42.248%); 1374 scrub errors; mds cluster is degraded; mds MDS1 is laggy; noin,nobackfill,norecover flag(s) set
# /etc/init.d/ceph start osd.1
=== osd.1 ===?
create-or-move updating item name 'osd.1' weight 1.82 at location {host=OSD1,root=default} to crush map
Starting Ceph osd.1 on OSD1...
starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
# ceph -w
? ? cluster 9b2c9bca-112e-48b0-86fc-587ef9a52948
? ? ?health HEALTH_ERR 164 pgs degraded; 38 pgs inconsistent; 192 pgs stuck unclean; recovery 1484224/3513098 objects degraded (42.248%); 1374 scrub errors; mds cluster is degraded; mds MDS1 is laggy; noin,nobackfill,norecover flag(s) set
? ? ?monmap e1: 1 mons at {MDS1=192.168.1.20:6789/0}, election epoch 1, quorum 0 MDS1
? ? ?mdsmap e7182: 1/1/1 up {0=MDS1=up:replay(laggy or crashed)}
? ? ?osdmap e3137: 6 osds: 4 up, 3 in
? ? ? ? ? ? flags noin,nobackfill,norecover
? ? ? pgmap v309463: 192 pgs, 3 pools, 1571 GB data, 1715 kobjects
? ? ? ? ? ? 1958 GB used, 3627 GB / 5586 GB avail
? ? ? ? ? ? 1484224/3513098 objects degraded (42.248%)
? ? ? ? ? ? ? ? ?131 active+degraded
? ? ? ? ? ? ? ? ? 23 active+remapped
? ? ? ? ? ? ? ? ? 33 active+degraded+inconsistent
? ? ? ? ? ? ? ? ? ?5 active+remapped+inconsistent

2014-07-19 21:34:59.166709 mon.0 [INF] pgmap v309463: 192 pgs: 131 active+degraded, 23 active+remapped, 33 active+degraded+inconsistent, 5 active+remapped+inconsistent; 1571 GB data, 1958 GB used, 3627 GB / 5586 GB avail; 1484224/3513098 objects degraded (42.248%)

osd.2 doesn't come up. ?osd.1 uses little memory compared to osd.0, but it stays alive. ?Killed osd.1 and osd.2 for now. ?At this point osd.0's CPU was on and off for a while. ?But it didn't kill it. ?So I did ceph osd unset noin and restarted osd.0. ?It seemed to be doing something for a long time. ?I let it run over night. ?Found it crashed today. ?Below is the log of it.

? ?-20> 2014-07-20 00:54:10.924602 7fb562528700 ?5 -- op tracker -- , seq: 4847, time: 2014-07-20 00:54:10.924244, event: header_read, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ?-19> 2014-07-20 00:54:10.924652 7fb562528700 ?5 -- op tracker -- , seq: 4847, time: 2014-07-20 00:54:10.924250, event: throttled, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ?-18> 2014-07-20 00:54:10.924698 7fb562528700 ?5 -- op tracker -- , seq: 4847, time: 2014-07-20 00:54:10.924458, event: all_read, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ?-17> 2014-07-20 00:54:10.924743 7fb562528700 ?5 -- op tracker -- , seq: 4847, time: 0.000000, event: dispatched, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ?-16> 2014-07-20 00:54:10.924880 7fb54d78f700 ?5 -- op tracker -- , seq: 4847, time: 2014-07-20 00:54:10.924861, event: reached_pg, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ?-15> 2014-07-20 00:54:10.924936 7fb54d78f700 ?5 -- op tracker -- , seq: 4847, time: 2014-07-20 00:54:10.924915, event: started, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ?-14> 2014-07-20 00:54:10.924974 7fb54d78f700 ?1 -- 192.168.2.30:6800/18511 --> 192.168.2.31:6804/13991 -- osd_sub_op_reply(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] ack, result = 0) v2 -- ?+1 0x10503680 con 0xfdca000
? ?-13> 2014-07-20 00:54:10.925053 7fb54d78f700 ?5 -- op tracker -- , seq: 4847, time: 2014-07-20 00:54:10.925034, event: done, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ?-12> 2014-07-20 00:54:10.926801 7fb562528700 ?1 -- 192.168.2.30:6800/18511 <== osd.4 192.168.2.31:6804/13991 1742 ==== osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[]) v10 ==== 1145+0+0 (2357365982 0 0) 0x1045a100 con 0xfdca000
? ?-11> 2014-07-20 00:54:10.926912 7fb562528700 ?5 -- op tracker -- , seq: 4848, time: 2014-07-20 00:54:10.926624, event: header_read, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ?-10> 2014-07-20 00:54:10.926961 7fb562528700 ?5 -- op tracker -- , seq: 4848, time: 2014-07-20 00:54:10.926628, event: throttled, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ? -9> 2014-07-20 00:54:10.927004 7fb562528700 ?5 -- op tracker -- , seq: 4848, time: 2014-07-20 00:54:10.926786, event: all_read, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ? -8> 2014-07-20 00:54:10.927046 7fb562528700 ?5 -- op tracker -- , seq: 4848, time: 0.000000, event: dispatched, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ? -7> 2014-07-20 00:54:10.927179 7fb54df90700 ?5 -- op tracker -- , seq: 4848, time: 2014-07-20 00:54:10.927160, event: reached_pg, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ? -6> 2014-07-20 00:54:10.927237 7fb54df90700 ?5 -- op tracker -- , seq: 4848, time: 2014-07-20 00:54:10.927216, event: started, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ? -5> 2014-07-20 00:54:10.927289 7fb54df90700 ?5 -- op tracker -- , seq: 4848, time: 2014-07-20 00:54:10.927269, event: done, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
? ? -4> 2014-07-20 00:54:10.941372 7fb551f98700 ?1 -- 192.168.2.30:6801/18511 <== osd.3 192.168.1.31:0/13623 776 ==== osd_ping(ping e3144 stamp 2014-07-20 00:54:10.942416) v2 ==== 47+0+0 (216963345 0 0) 0x103c0e00 con 0x1001bce0
? ? -3> 2014-07-20 00:54:10.941451 7fb551f98700 ?1 -- 192.168.2.30:6801/18511 --> 192.168.1.31:0/13623 -- osd_ping(ping_reply e3144 stamp 2014-07-20 00:54:10.942416) v2 -- ?+0 0x100e2540 con 0x1001bce0
? ? -2> 2014-07-20 00:54:10.941742 7fb55379b700 ?1 -- 192.168.1.30:6801/18511 <== osd.3 192.168.1.31:0/13623 776 ==== osd_ping(ping e3144 stamp 2014-07-20 00:54:10.942416) v2 ==== 47+0+0 (216963345 0 0) 0x10547880 con 0xff8db80
? ? -1> 2014-07-20 00:54:10.941842 7fb55379b700 ?1 -- 192.168.1.30:6801/18511 --> 192.168.1.31:0/13623 -- osd_ping(ping_reply e3144 stamp 2014-07-20 00:54:10.942416) v2 -- ?+0 0x10254a80 con 0xff8db80
? ? ?0> 2014-07-20 00:54:11.646226 7fb54c78d700 -1 os/DBObjectMap.cc: In function 'virtual bool DBObjectMap::DBObjectMapIteratorImpl::valid()' thread 7fb54c78d700 time 2014-07-20 00:54:11.640719
os/DBObjectMap.cc: 399: FAILED assert(!valid || cur_iter->valid())

?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
?1: /usr/bin/ceph-osd() [0xa72172]
?2: (ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x6c3) [0xa2df03]
?3: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x503) [0x98c523]
?4: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x10b) [0x891d4b]
?5: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x456) [0x8925d6]
?6: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0x10a) [0x7b00fa]
?7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb7792a]
?8: (ThreadPool::WorkThread::entry()+0x10) [0xb78b80]
?9: (()+0x8062) [0x7fb56b184062]
?10: (clone()+0x6d) [0x7fb569ac4a3d]
?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
? ?0/ 5 none
? ?0/ 1 lockdep
? ?0/ 1 context
? ?1/ 1 crush
? ?1/ 5 mds
? ?1/ 5 mds_balancer
? ?1/ 5 mds_locker
? ?1/ 5 mds_log
? ?1/ 5 mds_log_expire
? ?1/ 5 mds_migrator
? ?0/ 1 buffer
? ?0/ 1 timer
? ?0/ 1 filer
? ?0/ 1 striper
? ?0/ 1 objecter
? ?0/ 5 rados
? ?0/ 5 rbd
? ?0/ 5 journaler
? ?0/ 5 objectcacher
? ?0/ 5 client
? ?0/ 5 osd
? ?0/ 5 optracker? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
? ?0/ 5 objclass
? ?1/ 3 filestore
? ?1/ 3 keyvaluestore
? ?1/ 3 journal
? ?0/ 5 ms
? ?1/ 5 mon
? ?0/10 monc
? ?1/ 5 paxos
? ?0/ 5 tp
? ?1/ 5 auth
? ?1/ 5 crypto
? ?1/ 1 finisher
? ?1/ 5 heartbeatmap
? ?1/ 5 perfcounter
? ?1/ 5 rgw
? ?1/ 5 javaclient
? ?1/ 5 asok
? ?1/ 1 throttle
? -2/-2 (syslog threshold)
? -1/-1 (stderr threshold)
? max_recent ? ? 10000
? max_new ? ? ? ? 1000
? log_file /var/log/ceph/ceph-osd.0.log
--- end dump of recent events ---
2014-07-20 00:54:11.998700 7fb54c78d700 -1 *** Caught signal (Aborted) **
?in thread 7fb54c78d700

?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
?1: /usr/bin/ceph-osd() [0xaac562]
?2: (()+0xf880) [0x7fb56b18b880]
?3: (gsignal()+0x39) [0x7fb569a143a9]
?4: (abort()+0x148) [0x7fb569a174c8]
?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb56a3015e5]
?6: (()+0x5e746) [0x7fb56a2ff746]
?7: (()+0x5e773) [0x7fb56a2ff773]
?8: (()+0x5e9b2) [0x7fb56a2ff9b2]
?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
?10: /usr/bin/ceph-osd() [0xa72172]

?11: (ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x6c3) [0xa2df03]
?12: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x503) [0x98c523]
?13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x10b) [0x891d4b]
?14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x456) [0x8925d6]
?15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0x10a) [0x7b00fa]
?16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb7792a]
?17: (ThreadPool::WorkThread::entry()+0x10) [0xb78b80]
?18: (()+0x8062) [0x7fb56b184062]
?19: (clone()+0x6d) [0x7fb569ac4a3d]
?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
? ? -1> 2014-07-20 00:54:11.755763 7fb565618700 ?5 osd.0 3144 tick
? ? ?0> 2014-07-20 00:54:11.998700 7fb54c78d700 -1 *** Caught signal (Aborted) **
?in thread 7fb54c78d700

?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
?1: /usr/bin/ceph-osd() [0xaac562]
?2: (()+0xf880) [0x7fb56b18b880]
?3: (gsignal()+0x39) [0x7fb569a143a9]
?4: (abort()+0x148) [0x7fb569a174c8]
?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb56a3015e5]
?6: (()+0x5e746) [0x7fb56a2ff746]
?7: (()+0x5e773) [0x7fb56a2ff773]
?8: (()+0x5e9b2) [0x7fb56a2ff9b2]
?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
?10: /usr/bin/ceph-osd() [0xa72172]
?11: (ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x6c3) [0xa2df03]
?12: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x503) [0x98c523]
?13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x10b) [0x891d4b]
?14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x456) [0x8925d6]
?15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0x10a) [0x7b00fa]
?16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb7792a]
?17: (ThreadPool::WorkThread::entry()+0x10) [0xb78b80]
?18: (()+0x8062) [0x7fb56b184062]
?19: (clone()+0x6d) [0x7fb569ac4a3d]
?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
? ?0/ 5 none
? ?0/ 1 lockdep
? ?0/ 1 context
? ?1/ 1 crush
? ?1/ 5 mds
? ?1/ 5 mds_balancer
? ?1/ 5 mds_locker
? ?1/ 5 mds_log
? ?1/ 5 mds_log_expire
? ?1/ 5 mds_migrator
? ?0/ 1 buffer
? ?0/ 1 timer
? ?0/ 1 filer
? ?0/ 1 striper
? ?0/ 1 objecter
? ?0/ 5 rados
? ?0/ 5 rbd
? ?0/ 5 journaler
? ?0/ 5 objectcacher
? ?0/ 5 client
? ?0/ 5 osd
? ?0/ 5 optracker
? ?0/ 5 objclass
? ?1/ 3 filestore
? ?1/ 3 keyvaluestore
? ?1/ 3 journal
? ?0/ 5 ms
? ?1/ 5 mon
? ?0/10 monc
? ?1/ 5 paxos
? ?0/ 5 tp
? ?1/ 5 auth
? ?1/ 5 crypto
? ?1/ 1 finisher
? ?1/ 5 heartbeatmap
? ?1/ 5 perfcounter
? ?1/ 5 rgw
? ?1/ 5 javaclient
? ?1/ 5 asok
? ?1/ 1 throttle
? -2/-2 (syslog threshold)
? -1/-1 (stderr threshold)
? max_recent ? ? 10000
? max_new ? ? ? ? 1000
? log_file /var/log/ceph/ceph-osd.0.log
--- end dump of recent events ---

What can I do about this one?

Regards,
Hong

On Friday, July 18, 2014 5:16 PM, Craig Lewis <clewis at centraldesktop.com> wrote:

That I can't help you with. ?I'm a pure RadosGW user. ?But OSD stability affects everybody. :-P

On Fri, Jul 18, 2014 at 2:34 PM, hjcho616 <hjcho616 at yahoo.com> wrote:

Thanks Craig. ?I will try this soon. ?BTW should I upgrade to 0.80.4 first? ?The MDS journal issue seems to be one of the issue I am running into.
>
>
>Regards,
>Hong
>
>
>
>On Friday, July 18, 2014 4:14 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
> 
>
>
>If osd.3, osd.4, and osd.5 are stable, your cluster should be working again. ?What does ceph status say??
>
>
>
>
>I was able to re-add removed osd.
>Here's what I did on my dev cluster:
>stop ceph-osd id=0
>ceph osd down 0
>
>ceph osd out 0
>
>ceph osd rm 0
>ceph osd crush rm osd.0
>
>
>
>Now my osd tree and osd dump do not show osd.0. ?The cluster was degraded, but did not do any backfilling because I require 3x replication on 3 different hosts, and Ceph can't satisfy that with 2 osds.?
>
>
>On the same host, I ran:
>ceph osd create ? ? ? ?# Returned ID 0
>start ceph-osd id=0
>
>
>
>
>osd.0 started up and joined the cluster. ?Once peering completed, all of the PGs recovered quickly. ?I didn't have any writes on the cluster while I was doing this. ?
>
>
>So it looks like you can just re-create and start those deleted osds.
>
>
>
>
>
>
>In your situation, I would do the following. ?Before you start, go through this, and make sure you understand all the steps. ?Worst case, you can always undo this by removing the osds again, and you'll be back to where you are now.
>
>
>ceph osd set nobackfill
>ceph osd set norecovery
>ceph osd set noin
>ceph create osd ? # Should return 0. ?Abort if it doesn't.
>ceph create osd ? # Should return 1. ?Abort if it doesn't.
>ceph create osd ? # Should return 2. ?Abort if it doesn't.
>start ceph-osd id=0
>
>
>
>Watch ceph -w and top. ?Hopefully ceph-osd id=0 will use some CPU, then go UP, and drop to 0% cpu. ?If so,
>ceph osd unset noin
>restart ceph-osd id=0
>
>
>Now osd.0 should go UP and IN, use some CPU for a while, then drop to 0% cpu. ?If osd.0 drops out now, ?set noout, and shut it down. ?
>
>
>set noin again, and start osd.1. ?When it's stable, do it again for osd.2.
>
>
>Once as many as possible are up and stable:
>ceph osd unset nobackfill
>ceph osd unset norecovery
>
>
>Now it should start recovering. ?If your osds start dropping out now, ?set noout, and shut down the ones that are having problems. ?
>
>
>
>
>The goal is to get all the stable osds up, in, and recovered. ?Once that's done, we can figure out what to do with the unstable osds.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>On Thu, Jul 17, 2014 at 9:29 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>
>Sorry Craig. ?I thought I sent both but second part didn't copy right. ?For some reason over night MDS and MON decided to stop so I started it when I was running those commands. ?Interestingly MDS didn't fail at the time like it used to. ?So I thought something was being fixed? ?Then I now realize MDS probably couldn't get to the data because OSD were down. ?Now that I brought up the OSDs MDS crashed again. =P?
>>
>>
>>$ ceph osd tree
>># id ? ?weight ?type name ? ? ? up/down reweight
>>-1 ? ? ?5.46 ? ?root default
>>-2 ? ? ?0 ? ? ? ? ? ? ? host OSD1
>>-3 ? ? ?5.46 ? ? ? ? ? ?host OSD2
>>3 ? ? ? 1.82 ? ? ? ? ? ? ? ? ? ?osd.3 ? up ? ? ?1
>>4 ? ? ? 1.82 ? ? ? ? ? ? ? ? ? ?osd.4 ? up ? ? ?1
>>5 ? ? ? 1.82 ? ? ? ? ? ? ? ? ? ?osd.5 ? up ? ? ?1
>>
>>
>>$ ceph osd dump
>>epoch 3125
>>fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
>>created 2014-02-08 01:57:34.086532
>>modified 2014-07-17 23:24:10.823596
>>flags
>>pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 stripe_width 0
>>pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>max_osd 6
>>osd.3 up ? in ?weight 1 up_from 3120 up_thru 3122 down_at 3116 last_clean_interval [2858,3113) 192.168.1.31:6803/13623 192.168.2.31:6802/13623 192.168.2.31:6803/13623 192.168.1.31:6804/13623 exists,up 4f86a418-6c67-4cb4-83a1-6c123c890036
>>osd.4 up ? in ?weight 1 up_from 3121 up_thru 3122 down_at 3116 last_clean_interval [2859,3113) 192.168.1.31:6806/13991 192.168.2.31:6804/13991 192.168.2.31:6805/13991 192.168.1.31:6807/13991 exists,up 3d5e3843-7a47-44b0-b276-61c4b1d62900
>>osd.5 up ? in ?weight 1 up_from 3118 up_thru 3118 down_at 3116 last_clean_interval [2856,3113) 192.168.1.31:6800/13249 192.168.2.31:6800/13249 192.168.2.31:6801/13249 192.168.1.31:6801/13249 exists,up eec86483-2f35-48a4-a154-2eaf26be06b9
>>pg_temp 0.2 [4,3]
>>pg_temp 0.a [4,5]
>>pg_temp 0.c [3,4]
>>pg_temp 0.10 [3,4]
>>pg_temp 0.15 [3,5]
>>pg_temp 0.17 [3,5]
>>pg_temp 0.2f [4,5]
>>pg_temp 0.3b [4,3]
>>pg_temp 0.3c [3,5]
>>pg_temp 0.3d [4,5]
>>pg_temp 1.1 [4,3]
>>pg_temp 1.9 [4,5]
>>pg_temp 1.b [3,4]
>>pg_temp 1.14 [3,5]
>>pg_temp 1.16 [3,5]
>>pg_temp 1.2e [4,5]
>>pg_temp 1.3a [4,3]
>>pg_temp 1.3b [3,5]
>>pg_temp 1.3c [4,5]
>>pg_temp 2.0 [4,3]
>>pg_temp 2.8 [4,5]
>>pg_temp 2.a [3,4]
>>pg_temp 2.13 [3,5]
>>pg_temp 2.15 [3,5]
>>pg_temp 2.2d [4,5]
>>pg_temp 2.39 [4,3]
>>pg_temp 2.3a [3,5]
>>pg_temp 2.3b [4,5]
>>blacklist 192.168.1.20:6802/30894 expires 2014-07-17 23:48:10.823576
>>blacklist 192.168.1.20:6801/30651 expires 2014-07-17 23:47:55.562984
>>
>>
>>Regards,
>>Hong
>>
>>
>>
>>On Thursday, July 17, 2014 3:30 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
>> 
>>
>>
>>You gave me 'ceph osd dump' twice.... can I see 'ceph osd tree' too?
>>
>>
>>Why are osd.3, osd.4, and osd.5 down?
>>
>>
>>
>>On Thu, Jul 17, 2014 at 11:45 AM, hjcho616 <hjcho616 at yahoo.com> wrote:
>>
>>Thank you for looking at this. ?Below are the outputs you requested.
>>>
>>>
>>># ceph osd dump
>>>epoch 3117
>>>fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
>>>created 2014-02-08 01:57:34.086532
>>>modified 2014-07-16 22:13:04.385914
>>>flags?
>>>pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 stripe_width 0
>>>pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>>pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>>max_osd 6
>>>osd.3 down in ?weight 1 up_from 2858 up_thru 3040 down_at 3116 last_clean_interval [2830,2851) 192.168.1.31:6803/5127 192.168.2.31:6805/5127 192.168.2.31:6806/5127 192.168.1.31:6805/5127 exists 4f86a418-6c67-4cb4-83a1-6c123c890036
>>>osd.4 down in ?weight 1 up_from 2859 up_thru 3043 down_at 3116 last_clean_interval [2835,2849) 192.168.1.31:6807/5310 192.168.2.31:6807/5310 192.168.2.31:6808/5310 192.168.1.31:6808/5310 exists 3d5e3843-7a47-44b0-b276-61c4b1d62900
>>>osd.5 down in ?weight 1 up_from 2856 up_thru 3042 down_at 3116 last_clean_interval [2837,2853) 192.168.1.31:6800/4969 192.168.2.31:6801/4969 192.168.2.31:6804/4969 192.168.1.31:6801/4969 exists eec86483-2f35-48a4-a154-2eaf26be06b9
>>>
>>>
>>># ceph osd dump
>>>epoch 3117
>>>fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
>>>created 2014-02-08 01:57:34.086532
>>>modified 2014-07-16 22:13:04.385914
>>>flags?
>>>pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 stripe_width 0
>>>pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>>pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>>max_osd 6
>>>osd.3 down in ?weight 1 up_from 2858 up_thru 3040 down_at 3116 last_clean_interval [2830,2851) 192.168.1.31:6803/5127 192.168.2.31:6805/5127 192.168.2.31:6806/5127 192.168.1.31:6805/5127 exists 4f86a418-6c67-4cb4-83a1-6c123c890036
>>>osd.4 down in ?weight 1 up_from 2859 up_thru 3043 down_at 3116 last_clean_interval [2835,2849) 192.168.1.31:6807/5310 192.168.2.31:6807/5310 192.168.2.31:6808/5310 192.168.1.31:6808/5310 exists 3d5e3843-7a47-44b0-b276-61c4b1d62900
>>>osd.5 down in ?weight 1 up_from 2856 up_thru 3042 down_at 3116 last_clean_interval [2837,2853) 192.168.1.31:6800/4969 192.168.2.31:6801/4969 192.168.2.31:6804/4969 192.168.1.31:6801/4969 exists eec86483-2f35-48a4-a154-2eaf26be06b9
>>>
>>>
>>>Regards,
>>>Hong
>>>
>>>
>>>
>>>
>>>
>>>On Thursday, July 17, 2014 12:02 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
>>> 
>>>
>>>
>>>I don't believe you can re-add an OSD after `ceph osd rm`, but it's worth a shot. ?Let me see what I can do on my dev cluster.
>>>
>>>
>>>What does `ceph osd dump` and `ceph osd tree` say? ?I want to make sure I'm starting from the same point you are.
>>>
>>>
>>>
>>>
>>>
>>>On Wed, Jul 16, 2014 at 7:39 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>>>
>>>I did a "ceph osd rm" for all three but I didn't do anything else to it afterwards. ?Can this be added back?
>>>>
>>>>
>>>>Regards,
>>>>Hong
>>>>
>>>>
>>>>
>>>>On Wednesday, July 16, 2014 6:54 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
>>>> 
>>>>
>>>>
>>>>For some reason you ended up in my spam folder. ?That might be why you didn't get any responses.
>>>>
>>>>
>>>>
>>>>
>>>>Have you destroyed osd.0, osd.1, and osd.2? ?If not, try bringing them up one a time. ?You might have just one bad disk, which is much better than 50% of your disks.
>>>>
>>>>
>>>>
>>>>How is the ceph-osd process behaving when it hits the suicide timeout? ?I had some problems a while back where the ceph-osd process would startup, start consuming ~200% CPU for a while, then get stuck using almost exactly 100% CPU. ?It would get kicked out of the cluster for being unresponsive, then suicide. ?Repeat. ?If that's happening here, I can suggest some things to try.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>On Fri, Jul 11, 2014 at 9:12 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>>>>
>>>>I have 2 OSD machines with 3 OSD running on each. ?One MDS server with 3 daemons running. ?Ran cephfs mostly on 0.78. ?One night we lost power for split second. ?MDS1 and OSD2 went down, OSD1 seemed OK, well turns out OSD1 suffered most. ?Those two machines rebooted and seemed ok except it had some inconsistencies. ?I waited for a while, didn't fix itself. ?So I issued 'ceph pg repair pgnum'. ?It would try some and some OSD would crash. ?Tried this for multiple days. ?Got some PGs fixed... but mostly it would crash an OSD and stop recovering. ?dmesg shows something like below.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>[ ?740.059498] traps: ceph-osd[5279] general protection ip:7f84e75ec75e sp:7fff00045bc0 error:0 in libtcmalloc.so.4.1.0[7f84e75b3000+4a000]
>>>>>
>>>>>
>>>>>
>>>>>and ceph osd log shows something like this.
>>>>>
>>>>>
>>>>>? ? -2> 2014-07-09 20:51:01.163571 7fe0f4617700 ?1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had timed out after 60
>>>>>? ? -1> 2014-07-09 20:51:01.163609 7fe0f4617700 ?1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had suicide timed out after 180
>>>>>? ? ?0> 2014-07-09 20:51:01.169542 7fe0f4617700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe0f4617700 time 2014-07-09 20:51:01.163642
>>>>>common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>>>>>
>>>>>
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0xad2cbb]
>>>>>?2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>>>>>?3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>>>>>?4: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>>>>>?5: (()+0x8062) [0x7fe0f797e062]
>>>>>?6: (clone()+0x6d) [0x7fe0f62bea3d]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>--- logging levels ---
>>>>>? ?0/ 5 none
>>>>>? ?0/ 1 lockdep
>>>>>? ?0/ 1 context
>>>>>? ?1/ 1 crush
>>>>>? ?1/ 5 mds
>>>>>? ?1/ 5 mds_balancer
>>>>>? ?1/ 5 mds_locker
>>>>>? ?1/ 5 mds_log
>>>>>? ?1/ 5 mds_log_expire
>>>>>? ?1/ 5 mds_migrator
>>>>>? ?0/ 1 buffer
>>>>>? ?0/ 1 timer
>>>>>? ?0/ 1 filer
>>>>>? ?0/ 1 striper
>>>>>? ?0/ 1 objecter
>>>>>? ?0/ 5 rados
>>>>>? ?0/ 5 rbd
>>>>>? ?0/ 5 journaler
>>>>>? ?0/ 5 objectcacher
>>>>>? ?0/ 5 client
>>>>>? ?0/ 5 osd
>>>>>? ?0/ 5 optracker
>>>>>? ?0/ 5 objclass
>>>>>? ?1/ 3 filestore
>>>>>? ?1/ 3 keyvaluestore
>>>>>? ?1/ 3 journal
>>>>>? ?0/ 5 ms
>>>>>? ?1/ 5 mon
>>>>>? ?0/10 monc
>>>>>? ?1/ 5 paxos
>>>>>? ?0/ 5 tp
>>>>>? ?1/ 5 auth
>>>>>? ?1/ 5 crypto
>>>>>? ?1/ 1 finisher
>>>>>? ?1/ 5 heartbeatmap
>>>>>? ?1/ 5 perfcounter
>>>>>? ?1/ 5 rgw
>>>>>? ?1/ 5 javaclient
>>>>>? ?1/ 5 asok
>>>>>? ?1/ 1 throttle
>>>>>? -2/-2 (syslog threshold)
>>>>>? -1/-1 (stderr threshold)
>>>>>? max_recent ? ? 10000
>>>>>? max_new ? ? ? ? 1000
>>>>>? log_file /var/log/ceph/ceph-osd.0.log
>>>>>--- end dump of recent events ---
>>>>>2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal (Aborted) **
>>>>>?in thread 7fe0f4617700
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: /usr/bin/ceph-osd() [0xaac562]
>>>>>?2: (()+0xf880) [0x7fe0f7985880]
>>>>>?3: (gsignal()+0x39) [0x7fe0f620e3a9]
>>>>>?4: (abort()+0x148) [0x7fe0f62114c8]
>>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5]
>>>>>?6: (()+0x5e746) [0x7fe0f6af9746]
>>>>>?7: (()+0x5e773) [0x7fe0f6af9773]
>>>>>?8: (()+0x5e9b2) [0x7fe0f6af99b2]
>>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>>>>>?10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0xad2cbb]
>>>>>?11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>>>>>?12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>>>>>?13: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>>>>>?14: (()+0x8062) [0x7fe0f797e062]
>>>>>?15: (clone()+0x6d) [0x7fe0f62bea3d]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>--- begin dump of recent events ---
>>>>>? ? ?0> 2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal (Aborted) **
>>>>>?in thread 7fe0f4617700
>>>>>
>>>>>
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: /usr/bin/ceph-osd() [0xaac562]
>>>>>?2: (()+0xf880) [0x7fe0f7985880]
>>>>>?3: (gsignal()+0x39) [0x7fe0f620e3a9]
>>>>>?4: (abort()+0x148) [0x7fe0f62114c8]
>>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5]
>>>>>?6: (()+0x5e746) [0x7fe0f6af9746]
>>>>>?7: (()+0x5e773) [0x7fe0f6af9773]
>>>>>?8: (()+0x5e9b2) [0x7fe0f6af99b2]
>>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>>>>>?10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0xad2cbb]
>>>>>?11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>>>>>?12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>>>>>?13: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>>>>>?14: (()+0x8062) [0x7fe0f797e062]
>>>>>?15: (clone()+0x6d) [0x7fe0f62bea3d]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>--- logging levels ---
>>>>>? ?0/ 5 none
>>>>>? ?0/ 1 lockdep
>>>>>? ?0/ 1 context
>>>>>? ?1/ 1 crush
>>>>>? ?1/ 5 mds
>>>>>? ?1/ 5 mds_balancer
>>>>>? ?1/ 5 mds_locker
>>>>>? ?1/ 5 mds_log
>>>>>? ?1/ 5 mds_log_expire
>>>>>? ?1/ 5 mds_migrator
>>>>>? ?0/ 1 buffer
>>>>>? ?0/ 1 timer
>>>>>? ?0/ 1 filer
>>>>>? ?0/ 1 striper
>>>>>? ?0/ 1 objecter
>>>>>? ?0/ 5 rados
>>>>>? ?0/ 5 rbd
>>>>>? ?0/ 5 journaler
>>>>>? ?0/ 5 objectcacher
>>>>>? ?0/ 5 client
>>>>>? ?0/ 5 osd
>>>>>? ?0/ 5 optracker
>>>>>? ?0/ 5 objclass
>>>>>? ?1/ 3 filestore
>>>>>? ?1/ 3 keyvaluestore
>>>>>? ?1/ 3 journal
>>>>>? ?0/ 5 ms
>>>>>? ?1/ 5 mon
>>>>>? ?0/10 monc
>>>>>? ?1/ 5 paxos
>>>>>? ?0/ 5 tp
>>>>>? ?1/ 5 auth
>>>>>? ?1/ 5 crypto
>>>>>? ?1/ 1 finisher
>>>>>? ?1/ 5 heartbeatmap
>>>>>? ?1/ 5 perfcounter
>>>>>? ?1/ 5 rgw
>>>>>? ?1/ 5 javaclient
>>>>>? ?1/ 5 asok
>>>>>? ?1/ 1 throttle
>>>>>? -2/-2 (syslog threshold)
>>>>>? -1/-1 (stderr threshold)
>>>>>? max_recent ? ? 10000
>>>>>? max_new ? ? ? ? 1000
>>>>>? log_file /var/log/ceph/ceph-osd.0.log
>>>>>--- end dump of recent events ---
>>>>>
>>>>>
>>>>>After several attempts at it, osd.2 (which was on OSD1 which survived the power event) never comes up. ?Looks like journal was corrupted
>>>>>
>>>>>
>>>>>? ? -1> 2014-07-09 20:44:14.992840 7f12256b67c0 -1 journal Unable to read past sequence 2157634 but header indicates the journal has committed up through 2157670, journal is corrupt
>>>>>? ? ?0> 2014-07-09 20:44:14.998742 7f12256b67c0 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, bool*)' thread 7f12256b67c0 time 2014-07-09 20:44:14.993082
>>>>>os/FileJournal.cc: 1677: FAILED assert(0)
>>>>>
>>>>>
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x467) [0xa8d497]
>>>>>?2: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) [0x9dfebe]
>>>>>?3: (FileStore::mount()+0x32c9) [0x9b7939]
>>>>>?4: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa]
>>>>>?5: (main()+0x2237) [0x730837]
>>>>>?6: (__libc_start_main()+0xf5) [0x7f12236bdb45]
>>>>>?7: /usr/bin/ceph-osd() [0x734479]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>--- logging levels ---
>>>>>? ?0/ 5 none
>>>>>? ?0/ 1 lockdep
>>>>>? ?0/ 1 context
>>>>>? ?1/ 1 crush
>>>>>? ?1/ 5 mds
>>>>>? ?1/ 5 mds_balancer
>>>>>? ?1/ 5 mds_locker
>>>>>? ?1/ 5 mds_log
>>>>>? ?1/ 5 mds_log_expire
>>>>>? ?1/ 5 mds_migrator
>>>>>? ?0/ 1 buffer
>>>>>? ?0/ 1 timer
>>>>>? ?0/ 1 filer
>>>>>? ?0/ 1 striper
>>>>>? ?0/ 1 objecter
>>>>>? ?0/ 5 rados
>>>>>? ?0/ 5 rbd
>>>>>? ?0/ 5 journaler
>>>>>? ?0/ 5 objectcacher
>>>>>? ?0/ 5 client
>>>>>? ?0/ 5 osd
>>>>>? ?0/ 5 optracker
>>>>>? ?0/ 5 objclass
>>>>>? ?1/ 3 filestore
>>>>>? ?1/ 3 keyvaluestore
>>>>>? ?1/ 3 journal
>>>>>? ?0/ 5 ms
>>>>>? ?1/ 5 mon
>>>>>? ?0/10 monc
>>>>>? ?1/ 5 paxos
>>>>>? ?0/ 5 tp
>>>>>? ?1/ 5 auth
>>>>>? ?1/ 5 crypto
>>>>>? ?1/ 1 finisher
>>>>>? ?1/ 5 heartbeatmap
>>>>>? ?1/ 5 perfcounter
>>>>>? ?1/ 5 rgw
>>>>>? ?1/ 5 javaclient
>>>>>? ?1/ 5 asok
>>>>>? ?1/ 1 throttle
>>>>>? -2/-2 (syslog threshold)
>>>>>? -1/-1 (stderr threshold)
>>>>>? max_recent ? ? 10000
>>>>>? max_new ? ? ? ? 1000
>>>>>? log_file /var/log/ceph/ceph-osd.2.log
>>>>>--- end dump of recent events ---
>>>>>2014-07-09 20:44:15.010090 7f12256b67c0 -1 *** Caught signal (Aborted) **
>>>>>?in thread 7f12256b67c0
>>>>>ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: /usr/bin/ceph-osd() [0xaac562]
>>>>>?2: (()+0xf880) [0x7f1224e48880]
>>>>>?3: (gsignal()+0x39) [0x7f12236d13a9]
>>>>>?4: (abort()+0x148) [0x7f12236d44c8]
>>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1223fbe5e5]
>>>>>?6: (()+0x5e746) [0x7f1223fbc746]
>>>>>?7: (()+0x5e773) [0x7f1223fbc773]
>>>>>?8: (()+0x5e9b2) [0x7f1223fbc9b2]
>>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>>>>>?10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x467) [0xa8d497]
>>>>>?11: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) [0x9dfebe]
>>>>>?12: (FileStore::mount()+0x32c9) [0x9b7939]
>>>>>?13: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa]
>>>>>?14: (main()+0x2237) [0x730837]
>>>>>?15: (__libc_start_main()+0xf5) [0x7f12236bdb45]
>>>>>?16: /usr/bin/ceph-osd() [0x734479]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>--- begin dump of recent events ---
>>>>>? ? ?0> 2014-07-09 20:44:15.010090 7f12256b67c0 -1 *** Caught signal (Aborted) **
>>>>>?in thread 7f12256b67c0
>>>>>
>>>>>
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: /usr/bin/ceph-osd() [0xaac562]
>>>>>?2: (()+0xf880) [0x7f1224e48880]
>>>>>?3: (gsignal()+0x39) [0x7f12236d13a9]
>>>>>?4: (abort()+0x148) [0x7f12236d44c8]
>>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1223fbe5e5]
>>>>>?6: (()+0x5e746) [0x7f1223fbc746]
>>>>>?7: (()+0x5e773) [0x7f1223fbc773]
>>>>>?8: (()+0x5e9b2) [0x7f1223fbc9b2]
>>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>>>>>?10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x467) [0xa8d497]
>>>>>?11: (JournalingObjectStore::journal_replay(unsigned long)+0x22e) [0x9dfebe]
>>>>>?12: (FileStore::mount()+0x32c9) [0x9b7939]
>>>>>?13: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78d8fa]
>>>>>?14: (main()+0x2237) [0x730837]
>>>>>?15: (__libc_start_main()+0xf5) [0x7f12236bdb45]
>>>>>?16: /usr/bin/ceph-osd() [0x734479]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>--- logging levels ---
>>>>>? ?0/ 5 none
>>>>>? ?0/ 1 lockdep
>>>>>? ?0/ 1 context
>>>>>? ?1/ 1 crush
>>>>>? ?1/ 5 mds
>>>>>? ?1/ 5 mds_balancer
>>>>>? ?1/ 5 mds_locker
>>>>>? ?1/ 5 mds_log
>>>>>? ?1/ 5 mds_log_expire
>>>>>? ?1/ 5 mds_migrator
>>>>>? ?0/ 1 buffer
>>>>>? ?0/ 1 timer
>>>>>? ?0/ 1 filer
>>>>>? ?0/ 1 striper
>>>>>? ?0/ 1 objecter
>>>>>? ?0/ 5 rados
>>>>>? ?0/ 5 rbd
>>>>>? ?0/ 5 journaler
>>>>>? ?0/ 5 objectcacher
>>>>>? ?0/ 5 client
>>>>>? ?0/ 5 osd
>>>>>? ?0/ 5 optracker
>>>>>? ?0/ 5 objclass
>>>>>? ?1/ 3 filestore
>>>>>? ?1/ 3 keyvaluestore
>>>>>? ?1/ 3 journal
>>>>>? ?0/ 5 ms
>>>>>? ?1/ 5 mon
>>>>>? ?0/10 monc
>>>>>? ?1/ 5 paxos
>>>>>? ?0/ 5 tp
>>>>>? ?1/ 5 auth
>>>>>? ?1/ 5 crypto
>>>>>? ?1/ 1 finisher
>>>>>? ?1/ 5 heartbeatmap
>>>>>? ?1/ 5 perfcounter
>>>>>? ?1/ 5 rgw
>>>>>? ?1/ 5 javaclient
>>>>>? ?1/ 5 asok
>>>>>? ?1/ 1 throttle
>>>>>? -2/-2 (syslog threshold)
>>>>>? -1/-1 (stderr threshold)
>>>>>? max_recent ? ? 10000
>>>>>? max_new ? ? ? ? 1000
>>>>>? log_file /var/log/ceph/ceph-osd.2.log
>>>>>--- end dump of recent events ---
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>So I thought maybe upgrading 0.82 would give it a better option at fixing things... so I did, now not only those OSDs fail (osd.1 is up but with 14M of memory only... I assume that's broky too), but MDS fails too.
>>>>>
>>>>>
>>>>># /usr/bin/ceph-mds -i MDS1 --pid-file /var/run/ceph/mds.MDS1.pid -c /etc/ceph/ceph.conf --cluster ceph -f --debug-mds=20 --debug-journaler=10
>>>>>starting mds.MDS1 at :/0
>>>>>mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09 21:01:10.190965
>>>>>mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>>?2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>>?3: (()+0x8062) [0x7f8e0fe3f062]
>>>>>?4: (clone()+0x6d) [0x7f8e0ebd3a3d]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>2014-07-09 21:01:10.192936 7f8e07c21700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09 21:01:10.190965
>>>>>mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>>>>>
>>>>>
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>>?2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>>?3: (()+0x8062) [0x7f8e0fe3f062]
>>>>>?4: (clone()+0x6d) [0x7f8e0ebd3a3d]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>? ? ?0> 2014-07-09 21:01:10.192936 7f8e07c21700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f8e07c21700 time 2014-07-09 21:01:10.190965
>>>>>mds/MDLog.cc: 815: FAILED assert(journaler->is_readable())
>>>>>
>>>>>
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>>?2: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>>?3: (()+0x8062) [0x7f8e0fe3f062]
>>>>>?4: (clone()+0x6d) [0x7f8e0ebd3a3d]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>terminate called after throwing an instance of 'ceph::FailedAssertion'
>>>>>*** Caught signal (Aborted) **
>>>>>?in thread 7f8e07c21700
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: /usr/bin/ceph-mds() [0x8d81f2]
>>>>>?2: (()+0xf880) [0x7f8e0fe46880]
>>>>>?3: (gsignal()+0x39) [0x7f8e0eb233a9]
>>>>>?4: (abort()+0x148) [0x7f8e0eb264c8]
>>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5]
>>>>>?6: (()+0x5e746) [0x7f8e0f40e746]
>>>>>?7: (()+0x5e773) [0x7f8e0f40e773]
>>>>>?8: (()+0x5e9b2) [0x7f8e0f40e9b2]
>>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
>>>>>?10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>>?11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>>?12: (()+0x8062) [0x7f8e0fe3f062]
>>>>>?13: (clone()+0x6d) [0x7f8e0ebd3a3d]
>>>>>2014-07-09 21:01:10.201968 7f8e07c21700 -1 *** Caught signal (Aborted) **
>>>>>?in thread 7f8e07c21700
>>>>>
>>>>>
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: /usr/bin/ceph-mds() [0x8d81f2]
>>>>>?2: (()+0xf880) [0x7f8e0fe46880]
>>>>>?3: (gsignal()+0x39) [0x7f8e0eb233a9]
>>>>>?4: (abort()+0x148) [0x7f8e0eb264c8]
>>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5]
>>>>>?6: (()+0x5e746) [0x7f8e0f40e746]
>>>>>?7: (()+0x5e773) [0x7f8e0f40e773]
>>>>>?8: (()+0x5e9b2) [0x7f8e0f40e9b2]
>>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
>>>>>?10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>>?11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>>?12: (()+0x8062) [0x7f8e0fe3f062]
>>>>>?13: (clone()+0x6d) [0x7f8e0ebd3a3d]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>? ? ?0> 2014-07-09 21:01:10.201968 7f8e07c21700 -1 *** Caught signal (Aborted) **
>>>>>?in thread 7f8e07c21700
>>>>>
>>>>>
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: /usr/bin/ceph-mds() [0x8d81f2]
>>>>>?2: (()+0xf880) [0x7f8e0fe46880]
>>>>>?3: (gsignal()+0x39) [0x7f8e0eb233a9]
>>>>>?4: (abort()+0x148) [0x7f8e0eb264c8]
>>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f8e0f4105e5]
>>>>>?6: (()+0x5e746) [0x7f8e0f40e746]
>>>>>?7: (()+0x5e773) [0x7f8e0f40e773]
>>>>>?8: (()+0x5e9b2) [0x7f8e0f40e9b2]
>>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
>>>>>?10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>>?11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>>?12: (()+0x8062) [0x7f8e0fe3f062]
>>>>>?13: (clone()+0x6d) [0x7f8e0ebd3a3d]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>Aborted
>>>>>root at MDS1:/var/log/ceph# /usr/bin/ceph-mds -i MDS1 --pid-file /var/run/ceph/mds.MDS1.pid -c /etc/ceph/ceph.conf --cluster ceph -f --debug-mds=20 --debug-journaler=10
>>>>>starting mds.MDS1 at :/0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>>mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 23:21:43.383304 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>>mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?3: (()+0x8062) [0x7fb7ffda1062] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>>?4: (clone()+0x6d) [0x7fb7feb35a3d] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>>2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 23:21:43.383304 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>>? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?3: (()+0x8062) [0x7fb7ffda1062] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>>?4: (clone()+0x6d) [0x7fb7feb35a3d] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>>? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>? ? ?0> 2014-07-09 23:21:43.385274 7fb7f7b83700 -1 mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7fb7f7b83700 time 2014-07-09 23:21:43.383304 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>mds/MDLog.cc: 815: FAILED assert(journaler->is_readable()) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>>? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?1: (MDLog::_replay_thread()+0x197b) [0x85a3cb] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?2: (MDLog::ReplayThread::entry()+0xd) [0x66466d] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?3: (()+0x8062) [0x7fb7ffda1062] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>>?4: (clone()+0x6d) [0x7fb7feb35a3d] ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>>? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
>>>>>terminate called after throwing an instance of 'ceph::FailedAssertion' ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>>>>*** Caught signal (Aborted) **
>>>>>?in thread 7fb7f7b83700
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: /usr/bin/ceph-mds() [0x8d81f2]
>>>>>?2: (()+0xf880) [0x7fb7ffda8880]
>>>>>?3: (gsignal()+0x39) [0x7fb7fea853a9]
>>>>>?4: (abort()+0x148) [0x7fb7fea884c8]
>>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5]
>>>>>?6: (()+0x5e746) [0x7fb7ff370746]
>>>>>?7: (()+0x5e773) [0x7fb7ff370773]
>>>>>?8: (()+0x5e9b2) [0x7fb7ff3709b2]
>>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
>>>>>?10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>>?11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>>?12: (()+0x8062) [0x7fb7ffda1062]
>>>>>?13: (clone()+0x6d) [0x7fb7feb35a3d]
>>>>>2014-07-09 23:21:43.394324 7fb7f7b83700 -1 *** Caught signal (Aborted) **
>>>>>?in thread 7fb7f7b83700
>>>>>
>>>>>
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: /usr/bin/ceph-mds() [0x8d81f2]
>>>>>?2: (()+0xf880) [0x7fb7ffda8880]
>>>>>?3: (gsignal()+0x39) [0x7fb7fea853a9]
>>>>>?4: (abort()+0x148) [0x7fb7fea884c8]
>>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5]
>>>>>?6: (()+0x5e746) [0x7fb7ff370746]
>>>>>?7: (()+0x5e773) [0x7fb7ff370773]
>>>>>?8: (()+0x5e9b2) [0x7fb7ff3709b2]
>>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
>>>>>?10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>>?11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>>?12: (()+0x8062) [0x7fb7ffda1062]
>>>>>?13: (clone()+0x6d) [0x7fb7feb35a3d]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>? ? ?0> 2014-07-09 23:21:43.394324 7fb7f7b83700 -1 *** Caught signal (Aborted) **
>>>>>?in thread 7fb7f7b83700
>>>>>
>>>>>
>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>?1: /usr/bin/ceph-mds() [0x8d81f2]
>>>>>?2: (()+0xf880) [0x7fb7ffda8880]
>>>>>?3: (gsignal()+0x39) [0x7fb7fea853a9]
>>>>>?4: (abort()+0x148) [0x7fb7fea884c8]
>>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb7ff3725e5]
>>>>>?6: (()+0x5e746) [0x7fb7ff370746]
>>>>>?7: (()+0x5e773) [0x7fb7ff370773]
>>>>>?8: (()+0x5e9b2) [0x7fb7ff3709b2]
>>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
>>>>>?10: (MDLog::_replay_thread()+0x197b) [0x85a3cb]
>>>>>?11: (MDLog::ReplayThread::entry()+0xd) [0x66466d]
>>>>>?12: (()+0x8062) [0x7fb7ffda1062]
>>>>>?13: (clone()+0x6d) [0x7fb7feb35a3d]
>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>>
>>>>>Aborted
>>>>>
>>>>>
>>>>>Felt like OSD1 was trashed so I removed osd.0 osd.1 osd.2. ?
>>>>>
>>>>>
>>>>>Still seeing below, and can't get MDS up.
>>>>>
>>>>>
>>>>>HEALTH_ERR 154 pgs degraded; 38 pgs inconsistent; 192 pgs stuck unclean; recovery 1374024/3513098 objects degraded (39.111%); 1374 scrub errors; mds cluster is degraded; mds MDS1 is laggy
>>>>>
>>>>>
>>>>>Is there something I can try to bring this file system up again? =P ?I would like to access some of those data again. ?Let me know if you need any additional info. ?I was running Debian kernel 3.13.1 for first part, then 3.14.1 when I upgraded ceph to 0.82.
>>>>>
>>>>>
>>>>>Regards,
>>>>>Hong
>>>>>
>>>>>
>>>>>_______________________________________________
>>>>>ceph-users mailing list
>>>>>ceph-users at lists.ceph.com
>>>>>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140720/78964138/attachment.htm>