It is for a valid pool, however the up and acting sets for 2.14 both show OSDs 8 & 7. I'll take a look at 7 & 8 and see if they are good.
If so, it seems like it being present on osd.3 could be an artifact from previous topologies and I could mv it off old.3
Thanks very much for the assistance!
Berant
On Tuesday, May 19, 2015, Samuel Just <sjust@xxxxxxxxxx> wrote:
On Tuesday, May 19, 2015, Samuel Just <sjust@xxxxxxxxxx> wrote:
If 2.14 is part of a non-existent pool, you should be able to rename it out of current/ in the osd directory to prevent the osd from seeing it on startup.
-Sam
----- Original Message -----
From: "Berant Lemmenes" <berant@xxxxxxxxxxxx>
To: "Samuel Just" <sjust@xxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Sent: Tuesday, May 19, 2015 12:58:30 PM
Subject: Re: OSD unable to start (giant -> hammer)
Hello,
So here are the steps I performed and where I sit now.
Step 1) Using 'ceph-objectstore-tool list' to create a list of all PGs not
associated with the 3 pools (rbd, data, metadata) that are actually in use
on this cluster.
Step 2) I then did a 'ceph-objectstore-tool remove' of those PGs
Then when starting the OSD it would complain about PGs that were NOT in the
list of 'ceph-objectstore-tool list' but WERE present on the filesystem of
the OSD in question.
Step 3) Iterating over all of the PGs that were on disk and using
'ceph-objectstore-tool info' I made a list of all PGs that returned ENOENT,
Step 4) 'ceph-objectstore-tool remove' to remove all those as well.
Now when starting osd.3 I get an "unable to load metadata' error for a PG
that according to 'ceph pg 2.14 query' is not present (and shouldn't be) on
osd.3. Shown below with OSD debugging at 20:
<snip>
-23> 2015-05-19 15:15:12.712036 7fb079a20780 20 read_log 39533'174051
(39533'174050) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2811937 2015-05-18 07:18:42.859501
-22> 2015-05-19 15:15:12.712066 7fb079a20780 20 read_log 39533'174052
(39533'174051) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2812374 2015-05-18 07:33:21.973157
-21> 2015-05-19 15:15:12.712096 7fb079a20780 20 read_log 39533'174053
(39533'174052) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2812861 2015-05-18 07:48:23.098343
-20> 2015-05-19 15:15:12.712127 7fb079a20780 20 read_log 39533'174054
(39533'174053) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2813371 2015-05-18 08:03:54.226512
-19> 2015-05-19 15:15:12.712157 7fb079a20780 20 read_log 39533'174055
(39533'174054) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2813922 2015-05-18 08:18:20.351421
-18> 2015-05-19 15:15:12.712187 7fb079a20780 20 read_log 39533'174056
(39533'174055) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2814396 2015-05-18 08:33:56.476035
-17> 2015-05-19 15:15:12.712221 7fb079a20780 20 read_log 39533'174057
(39533'174056) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2814971 2015-05-18 08:48:22.605674
-16> 2015-05-19 15:15:12.712252 7fb079a20780 20 read_log 39533'174058
(39533'174057) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2815407 2015-05-18 09:02:48.720181
-15> 2015-05-19 15:15:12.712282 7fb079a20780 20 read_log 39533'174059
(39533'174058) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2815434 2015-05-18 09:03:43.727839
-14> 2015-05-19 15:15:12.712312 7fb079a20780 20 read_log 39533'174060
(39533'174059) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2815889 2015-05-18 09:17:49.846406
-13> 2015-05-19 15:15:12.712342 7fb079a20780 20 read_log 39533'174061
(39533'174060) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2816358 2015-05-18 09:32:50.969457
-12> 2015-05-19 15:15:12.712372 7fb079a20780 20 read_log 39533'174062
(39533'174061) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2816840 2015-05-18 09:47:52.091524
-11> 2015-05-19 15:15:12.712403 7fb079a20780 20 read_log 39533'174063
(39533'174062) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2816861 2015-05-18 09:48:22.096309
-10> 2015-05-19 15:15:12.712433 7fb079a20780 20 read_log 39533'174064
(39533'174063) modify 49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2817714 2015-05-18 10:02:53.222749
-9> 2015-05-19 15:15:12.713130 7fb079a20780 10 read_log done
-8> 2015-05-19 15:15:12.713550 7fb079a20780 10 osd.3 pg_epoch: 39533
pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive] handle_loaded
-7> 2015-05-19 15:15:12.713570 7fb079a20780 5 osd.3 pg_epoch: 39533
pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY] exit Initial
0.097986 0 0.000000
-6> 2015-05-19 15:15:12.713587 7fb079a20780 5 osd.3 pg_epoch: 39533
pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY] enter Reset
-5> 2015-05-19 15:15:12.713601 7fb079a20780 20 osd.3 pg_epoch: 39533
pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY]
set_last_peering_reset 39533
-4> 2015-05-19 15:15:12.713614 7fb079a20780 10 osd.3 pg_epoch: 39533
pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=39533
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY] Clearing
blocked outgoing recovery messages
-3> 2015-05-19 15:15:12.713629 7fb079a20780 10 osd.3 pg_epoch: 39533
pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=39533
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY] Not blocking
outgoing recovery messages
-2> 2015-05-19 15:15:12.713643 7fb079a20780 10 osd.3 39533 load_pgs
loaded pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529
n=101 ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=39533
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY]
log((37945'171063,39533'174064], crt=39533'174062)
-1> 2015-05-19 15:15:12.713658 7fb079a20780 10 osd.3 39533 pgid 2.14
coll 2.14_head
0> 2015-05-19 15:15:12.716475 7fb079a20780 -1 osd/PG.cc: In function
'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)'
thread 7fb079a20780 time 2015-05-19 15:15:12.715425
osd/PG.cc: 2860: FAILED assert(0 == "unable to open pg metadata")
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7f) [0xb1784f]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xb28)
[0x793dd8]
3: (OSD::load_pgs()+0x147f) [0x683dff]
4: (OSD::init()+0x1448) [0x6930b8]
5: (main()+0x26b9) [0x62fd89]
6: (__libc_start_main()+0xed) [0x7fb07767876d]
7: ceph-osd() [0x635679]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
20/20 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
-2/-2 (syslog threshold)
99/99 (stderr threshold)
max_recent 10000
max_new 1000
log_file
--- end dump of recent events ---
terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) **
in thread 7fb079a20780
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
1: ceph-osd() [0xa1fe55]
2: (()+0xfcb0) [0x7fb078a60cb0]
3: (gsignal()+0x35) [0x7fb07768d0d5]
4: (abort()+0x17b) [0x7fb07769083b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb077fde69d]
6: (()+0xb5846) [0x7fb077fdc846]
7: (()+0xb5873) [0x7fb077fdc873]
8: (()+0xb596e) [0x7fb077fdc96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xb17a29]
10: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xb28)
[0x793dd8]
11: (OSD::load_pgs()+0x147f) [0x683dff]
12: (OSD::init()+0x1448) [0x6930b8]
13: (main()+0x26b9) [0x62fd89]
14: (__libc_start_main()+0xed) [0x7fb07767876d]
15: ceph-osd() [0x635679]
2015-05-19 15:15:12.812704 7fb079a20780 -1 *** Caught signal (Aborted) **
in thread 7fb079a20780
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
1: ceph-osd() [0xa1fe55]
2: (()+0xfcb0) [0x7fb078a60cb0]
3: (gsignal()+0x35) [0x7fb07768d0d5]
4: (abort()+0x17b) [0x7fb07769083b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb077fde69d]
6: (()+0xb5846) [0x7fb077fdc846]
7: (()+0xb5873) [0x7fb077fdc873]
8: (()+0xb596e) [0x7fb077fdc96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xb17a29]
10: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xb28)
[0x793dd8]
11: (OSD::load_pgs()+0x147f) [0x683dff]
12: (OSD::init()+0x1448) [0x6930b8]
13: (main()+0x26b9) [0x62fd89]
14: (__libc_start_main()+0xed) [0x7fb07767876d]
15: ceph-osd() [0x635679]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.
--- begin dump of recent events ---
0> 2015-05-19 15:15:12.812704 7fb079a20780 -1 *** Caught signal
(Aborted) **
in thread 7fb079a20780
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
1: ceph-osd() [0xa1fe55]
2: (()+0xfcb0) [0x7fb078a60cb0]
3: (gsignal()+0x35) [0x7fb07768d0d5]
4: (abort()+0x17b) [0x7fb07769083b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb077fde69d]
6: (()+0xb5846) [0x7fb077fdc846]
7: (()+0xb5873) [0x7fb077fdc873]
8: (()+0xb596e) [0x7fb077fdc96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xb17a29]
10: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xb28)
[0x793dd8]
11: (OSD::load_pgs()+0x147f) [0x683dff]
12: (OSD::init()+0x1448) [0x6930b8]
13: (main()+0x26b9) [0x62fd89]
14: (__libc_start_main()+0xed) [0x7fb07767876d]
15: ceph-osd() [0x635679]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
20/20 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
-2/-2 (syslog threshold)
99/99 (stderr threshold)
max_recent 10000
max_new 1000
log_file
--- end dump of recent events ---
Here is the PG info for 2.14
ceph pg 2.14 query
{ "state": "active+undersized+degraded",
"snap_trimq": "[]",
"epoch": 39556,
"up": [
8,
7],
"acting": [
8,
7],
"actingbackfill": [
"7",
"8"],
"info": { "pgid": "2.14",
"last_update": "39533'175859",
"last_complete": "39533'175859",
"log_tail": "36964'172858",
"last_user_version": 175859,
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": { "epoch_created": 1,
"last_epoch_started": 39536,
"last_epoch_clean": 39536,
"last_epoch_split": 0,
"same_up_since": 39534,
"same_interval_since": 39534,
"same_primary_since": 39527,
"last_scrub": "39533'175859",
"last_scrub_stamp": "2015-05-18 05:23:02.952523",
"last_deep_scrub": "39533'175859",
"last_deep_scrub_stamp": "2015-05-18 05:23:02.952523",
"last_clean_scrub_stamp": "2015-05-18 05:23:02.952523"},
"stats": { "version": "39533'175859",
"reported_seq": "281883",
"reported_epoch": "39556",
"state": "active+undersized+degraded",
"last_fresh": "2015-05-19 06:41:09.002111",
"last_change": "2015-05-18 10:19:22.277851",
"last_active": "2015-05-19 06:41:09.002111",
"last_clean": "2015-05-18 06:41:38.906417",
"last_became_active": "2013-05-07 04:23:31.972742",
"last_unstale": "2015-05-19 06:41:09.002111",
"last_undegraded": "2015-05-18 10:18:37.449550",
"last_fullsized": "2015-05-18 10:18:37.449550",
"mapping_epoch": 39527,
"log_start": "36964'172858",
"ondisk_log_start": "36964'172858",
"created": 1,
"last_epoch_clean": 39536,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "39533'175859",
"last_scrub_stamp": "2015-05-18 05:23:02.952523",
"last_deep_scrub": "39533'175859",
"last_deep_scrub_stamp": "2015-05-18 05:23:02.952523",
"last_clean_scrub_stamp": "2015-05-18 05:23:02.952523",
"log_size": 3001,
"ondisk_log_size": 3001,
"stats_invalid": "0",
"stat_sum": { "num_bytes": 441982976,
"num_objects": 106,
"num_object_clones": 0,
"num_object_copies": 318,
"num_objects_missing_on_primary": 0,
"num_objects_degraded": 106,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 11,
"num_whiteouts": 0,
"num_read": 61399,
"num_read_kb": 1285319,
"num_write": 135192,
"num_write_kb": 2422029,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 79,
"num_bytes_recovered": 329883648,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0},
"stat_cat_sum": {},
"up": [
8,
7],
"acting": [
8,
7],
"blocked_by": [],
"up_primary": 8,
"acting_primary": 8},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 39536,
"hit_set_history": { "current_last_update": "0'0",
"current_last_stamp": "0.000000",
"current_info": { "begin": "0.000000",
"end": "0.000000",
"version": "0'0"},
"history": []}},
"peer_info": [
{ "peer": "7",
"pgid": "2.14",
"last_update": "39533'175859",
"last_complete": "39533'175859",
"log_tail": "36964'172858",
"last_user_version": 175859,
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": { "epoch_created": 1,
"last_epoch_started": 39536,
"last_epoch_clean": 39536,
"last_epoch_split": 0,
"same_up_since": 39534,
"same_interval_since": 39534,
"same_primary_since": 39527,
"last_scrub": "39533'175859",
"last_scrub_stamp": "2015-05-18 05:23:02.952523",
"last_deep_scrub": "39533'175859",
"last_deep_scrub_stamp": "2015-05-18 05:23:02.952523",
"last_clean_scrub_stamp": "2015-05-18 05:23:02.952523"},
"stats": { "version": "39533'175858",
"reported_seq": "281598",
"reported_epoch": "39533",
"state": "active+clean",
"last_fresh": "2015-05-13 21:58:43.553887",
"last_change": "2015-05-12 22:50:16.011917",
"last_active": "2015-05-13 21:58:43.553887",
"last_clean": "2015-05-13 21:58:43.553887",
"last_became_active": "2013-05-07 04:23:31.972742",
"last_unstale": "2015-05-13 21:58:43.553887",
"last_undegraded": "2015-05-13 21:58:43.553887",
"last_fullsized": "2015-05-13 21:58:43.553887",
"mapping_epoch": 39527,
"log_start": "36964'172857",
"ondisk_log_start": "36964'172857",
"created": 1,
"last_epoch_clean": 39529,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "39533'175857",
"last_scrub_stamp": "2015-05-12 22:50:16.011867",
"last_deep_scrub": "39533'175856",
"last_deep_scrub_stamp": "2015-05-10 10:30:24.933431",
"last_clean_scrub_stamp": "2015-05-12 22:50:16.011867",
"log_size": 3001,
"ondisk_log_size": 3001,
"stats_invalid": "0",
"stat_sum": { "num_bytes": 441982976,
"num_objects": 106,
"num_object_clones": 0,
"num_object_copies": 315,
"num_objects_missing_on_primary": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 11,
"num_whiteouts": 0,
"num_read": 61157,
"num_read_kb": 1281187,
"num_write": 135192,
"num_write_kb": 2422029,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 79,
"num_bytes_recovered": 329883648,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0},
"stat_cat_sum": {},
"up": [
8,
7],
"acting": [
8,
7],
"blocked_by": [],
"up_primary": 8,
"acting_primary": 8},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 39536,
"hit_set_history": { "current_last_update": "0'0",
"current_last_stamp": "0.000000",
"current_info": { "begin": "0.000000",
"end": "0.000000",
"version": "0'0"},
"history": []}}],
"recovery_state": [
{ "name": "Started\/Primary\/Active",
"enter_time": "2015-05-18 10:18:37.449561",
"might_have_unfound": [],
"recovery_progress": { "backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started": "0\/\/0\/\/-1",
"backfill_info": { "begin": "0\/\/0\/\/-1",
"end": "0\/\/0\/\/-1",
"objects": []},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": { "pull_from_peer": [],
"pushing": []}},
"scrub": { "scrubber.epoch_start": "39527",
"scrubber.active": 0,
"scrubber.block_writes": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []}},
{ "name": "Started",
"enter_time": "2015-05-18 10:18:05.335040"}],
"agent_state": {}}
On Mon, May 18, 2015 at 2:34 PM, Berant Lemmenes <berant@xxxxxxxxxxxx>
wrote:
> Sam,
>
> Thanks for taking a look. It does seem to fit my issue. Would just
> removing the 5.0_head directory be appropriate or would using
> ceph-objectstore-tool be better?
>
> Thanks,
> Berant
>
> On Mon, May 18, 2015 at 1:47 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>
>> You have most likely hit http://tracker.ceph.com/issues/11429. There
>> are some workarounds in the bugs marked as duplicates of that bug, or you
>> can wait for the next hammer point release.
>> -Sam
>>
>> ----- Original Message -----
>> From: "Berant Lemmenes" <berant@xxxxxxxxxxxx>
>> To: ceph-users@xxxxxxxxxxxxxx
>> Sent: Monday, May 18, 2015 10:24:38 AM
>> Subject: OSD unable to start (giant -> hammer)
>>
>> Hello all,
>>
>> I've encountered a problem when upgrading my single node home cluster
>> from giant to hammer, and I would greatly appreciate any insight.
>>
>> I upgraded the packages like normal, then proceeded to restart the mon
>> and once that came back restarted the first OSD (osd.3). However it
>> subsequently won't start and crashes with the following failed assertion:
>>
>>
>>
>> osd/OSD.h: 716: FAILED assert(ret)
>>
>> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x7f) [0xb1784f]
>>
>> 2: (OSD::load_pgs()+0x277b) [0x6850fb]
>>
>> 3: (OSD::init()+0x1448) [0x6930b8]
>>
>> 4: (main()+0x26b9) [0x62fd89]
>>
>> 5: (__libc_start_main()+0xed) [0x7f2345bc976d]
>>
>> 6: ceph-osd() [0x635679]
>>
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>>
>>
>>
>> --- logging levels ---
>>
>> 0/ 5 none
>>
>> 0/ 1 lockdep
>>
>> 0/ 1 context
>>
>> 1/ 1 crush
>>
>> 1/ 5 mds
>>
>> 1/ 5 mds_balancer
>>
>> 1/ 5 mds_locker
>>
>> 1/ 5 mds_log
>>
>> 1/ 5 mds_log_expire
>>
>> 1/ 5 mds_migrator
>>
>> 0/ 1 buffer
>>
>> 0/ 1 timer
>>
>> 0/ 1 filer
>>
>> 0/ 1 striper
>>
>> 0/ 1 objecter
>>
>> 0/ 5 rados
>>
>> 0/ 5 rbd
>>
>> 0/ 5 rbd_replay
>>
>> 0/ 5 journaler
>>
>> 0/ 5 objectcacher
>>
>> 0/ 5 client
>>
>> 0/ 5 osd
>>
>> 0/ 5 optracker
>>
>> 0/ 5 objclass
>>
>> 1/ 3 filestore
>>
>> 1/ 3 keyvaluestore
>>
>> 1/ 3 journal
>>
>> 0/ 5 ms
>>
>> 1/ 5 mon
>>
>> 0/10 monc
>>
>> 1/ 5 paxos
>>
>> 0/ 5 tp
>>
>> 1/ 5 auth
>>
>> 1/ 5 crypto
>>
>> 1/ 1 finisher
>>
>> 1/ 5 heartbeatmap
>>
>> 1/ 5 perfcounter
>>
>> 1/ 5 rgw
>>
>> 1/10 civetweb
>>
>> 1/ 5 javaclient
>>
>> 1/ 5 asok
>>
>> 1/ 1 throttle
>>
>> 0/ 0 refs
>>
>> 1/ 5 xio
>>
>> -2/-2 (syslog threshold)
>>
>> 99/99 (stderr threshold)
>>
>> max_recent 10000
>>
>> max_new 1000
>>
>> log_file
>>
>> --- end dump of recent events ---
>>
>> terminate called after throwing an instance of 'ceph::FailedAssertion'
>>
>> *** Caught signal (Aborted) **
>>
>> in thread 7f2347f71780
>>
>> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>
>> 1: ceph-osd() [0xa1fe55]
>>
>> 2: (()+0xfcb0) [0x7f2346fb1cb0]
>>
>> 3: (gsignal()+0x35) [0x7f2345bde0d5]
>>
>> 4: (abort()+0x17b) [0x7f2345be183b]
>>
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]
>>
>> 6: (()+0xb5846) [0x7f234652d846]
>>
>> 7: (()+0xb5873) [0x7f234652d873]
>>
>> 8: (()+0xb596e) [0x7f234652d96e]
>>
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x259) [0xb17a29]
>>
>> 10: (OSD::load_pgs()+0x277b) [0x6850fb]
>>
>> 11: (OSD::init()+0x1448) [0x6930b8]
>>
>> 12: (main()+0x26b9) [0x62fd89]
>>
>> 13: (__libc_start_main()+0xed) [0x7f2345bc976d]
>>
>> 14: ceph-osd() [0x635679]
>>
>> 2015-05-18 13:02:33.643064 7f2347f71780 -1 *** Caught signal (Aborted) **
>>
>> in thread 7f2347f71780
>>
>>
>>
>>
>> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>
>> 1: ceph-osd() [0xa1fe55]
>>
>> 2: (()+0xfcb0) [0x7f2346fb1cb0]
>>
>> 3: (gsignal()+0x35) [0x7f2345bde0d5]
>>
>> 4: (abort()+0x17b) [0x7f2345be183b]
>>
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]
>>
>> 6: (()+0xb5846) [0x7f234652d846]
>>
>> 7: (()+0xb5873) [0x7f234652d873]
>>
>> 8: (()+0xb596e) [0x7f234652d96e]
>>
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x259) [0xb17a29]
>>
>> 10: (OSD::load_pgs()+0x277b) [0x6850fb]
>>
>> 11: (OSD::init()+0x1448) [0x6930b8]
>>
>> 12: (main()+0x26b9) [0x62fd89]
>>
>> 13: (__libc_start_main()+0xed) [0x7f2345bc976d]
>>
>> 14: ceph-osd() [0x635679]
>>
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>>
>>
>>
>> --- begin dump of recent events ---
>>
>> 0> 2015-05-18 13:02:33.643064 7f2347f71780 -1 *** Caught signal (Aborted)
>> **
>>
>> in thread 7f2347f71780
>>
>>
>>
>>
>> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>
>> 1: ceph-osd() [0xa1fe55]
>>
>> 2: (()+0xfcb0) [0x7f2346fb1cb0]
>>
>> 3: (gsignal()+0x35) [0x7f2345bde0d5]
>>
>> 4: (abort()+0x17b) [0x7f2345be183b]
>>
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]
>>
>> 6: (()+0xb5846) [0x7f234652d846]
>>
>> 7: (()+0xb5873) [0x7f234652d873]
>>
>> 8: (()+0xb596e) [0x7f234652d96e]
>>
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x259) [0xb17a29]
>>
>> 10: (OSD::load_pgs()+0x277b) [0x6850fb]
>>
>> 11: (OSD::init()+0x1448) [0x6930b8]
>>
>> 12: (main()+0x26b9) [0x62fd89]
>>
>> 13: (__libc_start_main()+0xed) [0x7f2345bc976d]
>>
>> 14: ceph-osd() [0x635679]
>>
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>>
>>
>>
>> --- logging levels ---
>>
>> 0/ 5 none
>>
>> 0/ 1 lockdep
>>
>> 0/ 1 context
>>
>> 1/ 1 crush
>>
>> 1/ 5 mds
>>
>> 1/ 5 mds_balancer
>>
>> 1/ 5 mds_locker
>>
>> 1/ 5 mds_log
>>
>> 1/ 5 mds_log_expire
>>
>> 1/ 5 mds_migrator
>>
>> 0/ 1 buffer
>>
>> 0/ 1 timer
>>
>> 0/ 1 filer
>>
>> 0/ 1 striper
>>
>> 0/ 1 objecter
>>
>> 0/ 5 rados
>>
>> 0/ 5 rbd
>>
>> 0/ 5 rbd_replay
>>
>> 0/ 5 journaler
>>
>> 0/ 5 objectcacher
>>
>> 0/ 5 client
>>
>> 0/ 5 osd
>>
>> 0/ 5 optracker
>>
>> 0/ 5 objclass
>>
>> 1/ 3 filestore
>>
>> 1/ 3 keyvaluestore
>>
>> 1/ 3 journal
>>
>> 0/ 5 ms
>>
>> 1/ 5 mon
>>
>> 0/10 monc
>>
>> 1/ 5 paxos
>>
>> 0/ 5 tp
>>
>> 1/ 5 auth
>>
>> 1/ 5 crypto
>>
>> 1/ 1 finisher
>>
>> 1/ 5 heartbeatmap
>>
>> 1/ 5 perfcounter
>>
>> 1/ 5 rgw
>>
>> 1/10 civetweb
>>
>> 1/ 5 javaclient
>>
>> 1/ 5 asok
>>
>> 1/ 1 throttle
>>
>> 0/ 0 refs
>>
>> 1/ 5 xio
>>
>> -2/-2 (syslog threshold)
>>
>> 99/99 (stderr threshold)
>>
>> max_recent 10000
>>
>> max_new 1000
>>
>> log_file
>>
>> --- end dump of recent events ---
>>
>>
>> I've included a 'ceph osd dump' here:
>> http://pastebin.com/RKbaY7nv
>>
>> ceph osd tree:
>>
>>
>> ceph osd tree
>>
>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>
>> -1 24.14000 root default
>>
>> -3 0 rack unknownrack
>>
>> -2 0 host ceph-test
>>
>> -4 24.14000 host ceph01
>>
>> 0 1.50000 osd.0 down 0 1.00000
>>
>> 2 1.50000 osd.2 down 0 1.00000
>>
>> 3 1.50000 osd.3 down 1.00000 1.00000
>>
>> 5 2.00000 osd.5 up 1.00000 1.00000
>>
>> 6 2.00000 osd.6 up 1.00000 1.00000
>>
>> 7 2.00000 osd.7 up 1.00000 1.00000
>>
>> 8 2.00000 osd.8 up 1.00000 1.00000
>>
>> 9 2.00000 osd.9 up 1.00000 1.00000
>>
>> 10 2.00000 osd.10 up 1.00000 1.00000
>>
>> 4 4.00000 osd.4 up 1.00000 1.00000
>>
>> 1 3.64000 osd.1 up 1.00000 1.00000
>>
>>
>>
>>
>> Note that osd.0 and osd.2 were down prior to the upgrade and the cluster
>> was healthy (these are failed disks that have been out for some time just
>> not removed from CRUSH.
>>
>> I've also included a log with OSD debugging set to 20 here:
>>
>> https://dl.dropboxusercontent.com/u/1043493/osd.3.log.gz
>>
>>
>> Looking through that file, it appears the last pg that it loads
>> successfully is 2.3f6 then it moves to 5.0
>>
>> -3> 2015-05-18 12:25:24.292091 7f6f407f9780 10 osd.3 39533 load_pgs
>> loaded pg[2.3f6( v 39533'289849 (37945'286848,39533'289849] local-les=39532
>> n=99 ec=1 les/c 39532/39532 39531/39531/39523) [5,4,3] r=2 lpr=39533
>> pi=34961-39530/34 crt=39533'289846 lcod 0'0 inactive NOTIFY]
>> log((37945'286848,39533'289849], crt=39533'289846)
>>
>> -2> 2015-05-18 12:25:24.292100 7f6f407f9780 10 osd.3 39533 pgid 5.0 coll
>> 5.0_head
>>
>> -1> 2015-05-18 12:25:24.570188 7f6f407f9780 20 osd.3 0 get_map 34144 -
>> loading and decoding 0x411fd80
>>
>> 0> 2015-05-18 12:26:02.758914 7f6f407f9780 -1 osd/OSD.h: In function
>> 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f6f407f9780 time
>> 2015-05-18 12:25:24.620468
>>
>>
>>
>> osd/OSD.h: 716: FAILED assert(ret)
>>
>> [snip]
>>
>> Which I don't see 5.0 in a pg dump.
>>
>>
>>
>>
>> Thanks in advance!
>>
>> Berant
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com