Re: OSD unable to start (giant -> hammer)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



If 2.14 is part of a non-existent pool, you should be able to rename it out of current/ in the osd directory to prevent the osd from seeing it on startup.
-Sam

----- Original Message -----
From: "Berant Lemmenes" <berant@xxxxxxxxxxxx>
To: "Samuel Just" <sjust@xxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Sent: Tuesday, May 19, 2015 12:58:30 PM
Subject: Re:  OSD unable to start (giant -> hammer)

Hello,

So here are the steps I performed and where I sit now.

Step 1) Using 'ceph-objectstore-tool list' to create a list of all PGs not
associated with the 3 pools (rbd, data, metadata) that are actually in use
on this cluster.

Step 2) I then did a 'ceph-objectstore-tool remove' of those PGs

Then when starting the OSD it would complain about PGs that were NOT in the
list of 'ceph-objectstore-tool list' but WERE present on the filesystem of
the OSD in question.

Step 3) Iterating over all of the PGs that were on disk and using
'ceph-objectstore-tool info' I made a list of all PGs that returned ENOENT,

Step 4) 'ceph-objectstore-tool remove' to remove all those as well.

Now when starting osd.3 I get an "unable to load metadata' error for a PG
that according to 'ceph pg 2.14 query' is not present (and shouldn't be) on
osd.3. Shown below with OSD debugging at 20:

<snip>

   -23> 2015-05-19 15:15:12.712036 7fb079a20780 20 read_log 39533'174051
(39533'174050) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2811937 2015-05-18 07:18:42.859501

   -22> 2015-05-19 15:15:12.712066 7fb079a20780 20 read_log 39533'174052
(39533'174051) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2812374 2015-05-18 07:33:21.973157

   -21> 2015-05-19 15:15:12.712096 7fb079a20780 20 read_log 39533'174053
(39533'174052) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2812861 2015-05-18 07:48:23.098343

   -20> 2015-05-19 15:15:12.712127 7fb079a20780 20 read_log 39533'174054
(39533'174053) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2813371 2015-05-18 08:03:54.226512

   -19> 2015-05-19 15:15:12.712157 7fb079a20780 20 read_log 39533'174055
(39533'174054) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2813922 2015-05-18 08:18:20.351421

   -18> 2015-05-19 15:15:12.712187 7fb079a20780 20 read_log 39533'174056
(39533'174055) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2814396 2015-05-18 08:33:56.476035

   -17> 2015-05-19 15:15:12.712221 7fb079a20780 20 read_log 39533'174057
(39533'174056) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2814971 2015-05-18 08:48:22.605674

   -16> 2015-05-19 15:15:12.712252 7fb079a20780 20 read_log 39533'174058
(39533'174057) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2815407 2015-05-18 09:02:48.720181

   -15> 2015-05-19 15:15:12.712282 7fb079a20780 20 read_log 39533'174059
(39533'174058) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2815434 2015-05-18 09:03:43.727839

   -14> 2015-05-19 15:15:12.712312 7fb079a20780 20 read_log 39533'174060
(39533'174059) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2815889 2015-05-18 09:17:49.846406

   -13> 2015-05-19 15:15:12.712342 7fb079a20780 20 read_log 39533'174061
(39533'174060) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2816358 2015-05-18 09:32:50.969457

   -12> 2015-05-19 15:15:12.712372 7fb079a20780 20 read_log 39533'174062
(39533'174061) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2816840 2015-05-18 09:47:52.091524

   -11> 2015-05-19 15:15:12.712403 7fb079a20780 20 read_log 39533'174063
(39533'174062) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2816861 2015-05-18 09:48:22.096309

   -10> 2015-05-19 15:15:12.712433 7fb079a20780 20 read_log 39533'174064
(39533'174063) modify   49277412/rb.0.100f.2ae8944a.000000029945/head//2 by
client.18119.0:2817714 2015-05-18 10:02:53.222749

    -9> 2015-05-19 15:15:12.713130 7fb079a20780 10 read_log done

    -8> 2015-05-19 15:15:12.713550 7fb079a20780 10 osd.3 pg_epoch: 39533
pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive] handle_loaded

    -7> 2015-05-19 15:15:12.713570 7fb079a20780  5 osd.3 pg_epoch: 39533
pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY] exit Initial
0.097986 0 0.000000

    -6> 2015-05-19 15:15:12.713587 7fb079a20780  5 osd.3 pg_epoch: 39533
pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY] enter Reset

    -5> 2015-05-19 15:15:12.713601 7fb079a20780 20 osd.3 pg_epoch: 39533
pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=0
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY]
set_last_peering_reset 39533

    -4> 2015-05-19 15:15:12.713614 7fb079a20780 10 osd.3 pg_epoch: 39533
pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=39533
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY] Clearing
blocked outgoing recovery messages

    -3> 2015-05-19 15:15:12.713629 7fb079a20780 10 osd.3 pg_epoch: 39533
pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529 n=101
ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=39533
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY] Not blocking
outgoing recovery messages

    -2> 2015-05-19 15:15:12.713643 7fb079a20780 10 osd.3 39533 load_pgs
loaded pg[2.12( v 39533'174064 (37945'171063,39533'174064] local-les=39529
n=101 ec=1 les/c 39529/39529 39526/39526/39526) [9,3,10] r=1 lpr=39533
pi=37959-39525/7 crt=39533'174062 lcod 0'0 inactive NOTIFY]
log((37945'171063,39533'174064], crt=39533'174062)

    -1> 2015-05-19 15:15:12.713658 7fb079a20780 10 osd.3 39533 pgid 2.14
coll 2.14_head

     0> 2015-05-19 15:15:12.716475 7fb079a20780 -1 osd/PG.cc: In function
'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)'
thread 7fb079a20780 time 2015-05-19 15:15:12.715425

osd/PG.cc: 2860: FAILED assert(0 == "unable to open pg metadata")


 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7f) [0xb1784f]

 2: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xb28)
[0x793dd8]

 3: (OSD::load_pgs()+0x147f) [0x683dff]

 4: (OSD::init()+0x1448) [0x6930b8]

 5: (main()+0x26b9) [0x62fd89]

 6: (__libc_start_main()+0xed) [0x7fb07767876d]

 7: ceph-osd() [0x635679]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.


--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

   1/ 1 crush

   1/ 5 mds

   1/ 5 mds_balancer

   1/ 5 mds_locker

   1/ 5 mds_log

   1/ 5 mds_log_expire

   1/ 5 mds_migrator

   0/ 1 buffer

   0/ 1 timer

   0/ 1 filer

   0/ 1 striper

   0/ 1 objecter

   0/ 5 rados

   0/ 5 rbd

   0/ 5 rbd_replay

   0/ 5 journaler

   0/ 5 objectcacher

   0/ 5 client

  20/20 osd

   0/ 5 optracker

   0/ 5 objclass

   1/ 3 filestore

   1/ 3 keyvaluestore

   1/ 3 journal

   0/ 5 ms

   1/ 5 mon

   0/10 monc

   1/ 5 paxos

   0/ 5 tp

   1/ 5 auth

   1/ 5 crypto

   1/ 1 finisher

   1/ 5 heartbeatmap

   1/ 5 perfcounter

   1/ 5 rgw

   1/10 civetweb

   1/ 5 javaclient

   1/ 5 asok

   1/ 1 throttle

   0/ 0 refs

   1/ 5 xio

  -2/-2 (syslog threshold)

  99/99 (stderr threshold)

  max_recent     10000

  max_new         1000

  log_file

--- end dump of recent events ---

terminate called after throwing an instance of 'ceph::FailedAssertion'

*** Caught signal (Aborted) **

 in thread 7fb079a20780

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

 1: ceph-osd() [0xa1fe55]

 2: (()+0xfcb0) [0x7fb078a60cb0]

 3: (gsignal()+0x35) [0x7fb07768d0d5]

 4: (abort()+0x17b) [0x7fb07769083b]

 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb077fde69d]

 6: (()+0xb5846) [0x7fb077fdc846]

 7: (()+0xb5873) [0x7fb077fdc873]

 8: (()+0xb596e) [0x7fb077fdc96e]

 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xb17a29]

 10: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xb28)
[0x793dd8]

 11: (OSD::load_pgs()+0x147f) [0x683dff]

 12: (OSD::init()+0x1448) [0x6930b8]

 13: (main()+0x26b9) [0x62fd89]

 14: (__libc_start_main()+0xed) [0x7fb07767876d]

 15: ceph-osd() [0x635679]

2015-05-19 15:15:12.812704 7fb079a20780 -1 *** Caught signal (Aborted) **

 in thread 7fb079a20780


 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

 1: ceph-osd() [0xa1fe55]

 2: (()+0xfcb0) [0x7fb078a60cb0]

 3: (gsignal()+0x35) [0x7fb07768d0d5]

 4: (abort()+0x17b) [0x7fb07769083b]

 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb077fde69d]

 6: (()+0xb5846) [0x7fb077fdc846]

 7: (()+0xb5873) [0x7fb077fdc873]

 8: (()+0xb596e) [0x7fb077fdc96e]

 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xb17a29]

 10: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xb28)
[0x793dd8]

 11: (OSD::load_pgs()+0x147f) [0x683dff]

 12: (OSD::init()+0x1448) [0x6930b8]

 13: (main()+0x26b9) [0x62fd89]

 14: (__libc_start_main()+0xed) [0x7fb07767876d]

 15: ceph-osd() [0x635679]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.


--- begin dump of recent events ---

     0> 2015-05-19 15:15:12.812704 7fb079a20780 -1 *** Caught signal
(Aborted) **

 in thread 7fb079a20780


 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

 1: ceph-osd() [0xa1fe55]

 2: (()+0xfcb0) [0x7fb078a60cb0]

 3: (gsignal()+0x35) [0x7fb07768d0d5]

 4: (abort()+0x17b) [0x7fb07769083b]

 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb077fde69d]

 6: (()+0xb5846) [0x7fb077fdc846]

 7: (()+0xb5873) [0x7fb077fdc873]

 8: (()+0xb596e) [0x7fb077fdc96e]

 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xb17a29]

 10: (PG::peek_map_epoch(ObjectStore*, spg_t, ceph::buffer::list*)+0xb28)
[0x793dd8]

 11: (OSD::load_pgs()+0x147f) [0x683dff]

 12: (OSD::init()+0x1448) [0x6930b8]

 13: (main()+0x26b9) [0x62fd89]

 14: (__libc_start_main()+0xed) [0x7fb07767876d]

 15: ceph-osd() [0x635679]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.


--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

   1/ 1 crush

   1/ 5 mds

   1/ 5 mds_balancer

   1/ 5 mds_locker

   1/ 5 mds_log

   1/ 5 mds_log_expire

   1/ 5 mds_migrator

   0/ 1 buffer

   0/ 1 timer

   0/ 1 filer

   0/ 1 striper

   0/ 1 objecter

   0/ 5 rados

   0/ 5 rbd

   0/ 5 rbd_replay

   0/ 5 journaler

   0/ 5 objectcacher

   0/ 5 client

  20/20 osd

   0/ 5 optracker

   0/ 5 objclass

   1/ 3 filestore

   1/ 3 keyvaluestore

   1/ 3 journal

   0/ 5 ms

   1/ 5 mon

   0/10 monc

   1/ 5 paxos

   0/ 5 tp

   1/ 5 auth

   1/ 5 crypto

   1/ 1 finisher

   1/ 5 heartbeatmap

   1/ 5 perfcounter

   1/ 5 rgw

   1/10 civetweb

   1/ 5 javaclient

   1/ 5 asok

   1/ 1 throttle

   0/ 0 refs

   1/ 5 xio

  -2/-2 (syslog threshold)

  99/99 (stderr threshold)

  max_recent     10000

  max_new         1000

  log_file

--- end dump of recent events ---


Here is the PG info for 2.14

ceph pg 2.14 query

{ "state": "active+undersized+degraded",

  "snap_trimq": "[]",

  "epoch": 39556,

  "up": [

        8,

        7],

  "acting": [

        8,

        7],

  "actingbackfill": [

        "7",

        "8"],

  "info": { "pgid": "2.14",

      "last_update": "39533'175859",

      "last_complete": "39533'175859",

      "log_tail": "36964'172858",

      "last_user_version": 175859,

      "last_backfill": "MAX",

      "purged_snaps": "[]",

      "history": { "epoch_created": 1,

          "last_epoch_started": 39536,

          "last_epoch_clean": 39536,

          "last_epoch_split": 0,

          "same_up_since": 39534,

          "same_interval_since": 39534,

          "same_primary_since": 39527,

          "last_scrub": "39533'175859",

          "last_scrub_stamp": "2015-05-18 05:23:02.952523",

          "last_deep_scrub": "39533'175859",

          "last_deep_scrub_stamp": "2015-05-18 05:23:02.952523",

          "last_clean_scrub_stamp": "2015-05-18 05:23:02.952523"},

      "stats": { "version": "39533'175859",

          "reported_seq": "281883",

          "reported_epoch": "39556",

          "state": "active+undersized+degraded",

          "last_fresh": "2015-05-19 06:41:09.002111",

          "last_change": "2015-05-18 10:19:22.277851",

          "last_active": "2015-05-19 06:41:09.002111",

          "last_clean": "2015-05-18 06:41:38.906417",

          "last_became_active": "2013-05-07 04:23:31.972742",

          "last_unstale": "2015-05-19 06:41:09.002111",

          "last_undegraded": "2015-05-18 10:18:37.449550",

          "last_fullsized": "2015-05-18 10:18:37.449550",

          "mapping_epoch": 39527,

          "log_start": "36964'172858",

          "ondisk_log_start": "36964'172858",

          "created": 1,

          "last_epoch_clean": 39536,

          "parent": "0.0",

          "parent_split_bits": 0,

          "last_scrub": "39533'175859",

          "last_scrub_stamp": "2015-05-18 05:23:02.952523",

          "last_deep_scrub": "39533'175859",

          "last_deep_scrub_stamp": "2015-05-18 05:23:02.952523",

          "last_clean_scrub_stamp": "2015-05-18 05:23:02.952523",

          "log_size": 3001,

          "ondisk_log_size": 3001,

          "stats_invalid": "0",

          "stat_sum": { "num_bytes": 441982976,

              "num_objects": 106,

              "num_object_clones": 0,

              "num_object_copies": 318,

              "num_objects_missing_on_primary": 0,

              "num_objects_degraded": 106,

              "num_objects_misplaced": 0,

              "num_objects_unfound": 0,

              "num_objects_dirty": 11,

              "num_whiteouts": 0,

              "num_read": 61399,

              "num_read_kb": 1285319,

              "num_write": 135192,

              "num_write_kb": 2422029,

              "num_scrub_errors": 0,

              "num_shallow_scrub_errors": 0,

              "num_deep_scrub_errors": 0,

              "num_objects_recovered": 79,

              "num_bytes_recovered": 329883648,

              "num_keys_recovered": 0,

              "num_objects_omap": 0,

              "num_objects_hit_set_archive": 0,

              "num_bytes_hit_set_archive": 0},

          "stat_cat_sum": {},

          "up": [

                8,

                7],

          "acting": [

                8,

                7],

          "blocked_by": [],

          "up_primary": 8,

          "acting_primary": 8},

      "empty": 0,

      "dne": 0,

      "incomplete": 0,

      "last_epoch_started": 39536,

      "hit_set_history": { "current_last_update": "0'0",

          "current_last_stamp": "0.000000",

          "current_info": { "begin": "0.000000",

              "end": "0.000000",

              "version": "0'0"},

          "history": []}},

  "peer_info": [

        { "peer": "7",

          "pgid": "2.14",

          "last_update": "39533'175859",

          "last_complete": "39533'175859",

          "log_tail": "36964'172858",

          "last_user_version": 175859,

          "last_backfill": "MAX",

          "purged_snaps": "[]",

          "history": { "epoch_created": 1,

              "last_epoch_started": 39536,

              "last_epoch_clean": 39536,

              "last_epoch_split": 0,

              "same_up_since": 39534,

              "same_interval_since": 39534,

              "same_primary_since": 39527,

              "last_scrub": "39533'175859",

              "last_scrub_stamp": "2015-05-18 05:23:02.952523",

              "last_deep_scrub": "39533'175859",

              "last_deep_scrub_stamp": "2015-05-18 05:23:02.952523",

              "last_clean_scrub_stamp": "2015-05-18 05:23:02.952523"},

          "stats": { "version": "39533'175858",

              "reported_seq": "281598",

              "reported_epoch": "39533",

              "state": "active+clean",

              "last_fresh": "2015-05-13 21:58:43.553887",

              "last_change": "2015-05-12 22:50:16.011917",

              "last_active": "2015-05-13 21:58:43.553887",

              "last_clean": "2015-05-13 21:58:43.553887",

              "last_became_active": "2013-05-07 04:23:31.972742",

              "last_unstale": "2015-05-13 21:58:43.553887",

              "last_undegraded": "2015-05-13 21:58:43.553887",

              "last_fullsized": "2015-05-13 21:58:43.553887",

              "mapping_epoch": 39527,

              "log_start": "36964'172857",

              "ondisk_log_start": "36964'172857",

              "created": 1,

              "last_epoch_clean": 39529,

              "parent": "0.0",

              "parent_split_bits": 0,

              "last_scrub": "39533'175857",

              "last_scrub_stamp": "2015-05-12 22:50:16.011867",

              "last_deep_scrub": "39533'175856",

              "last_deep_scrub_stamp": "2015-05-10 10:30:24.933431",

              "last_clean_scrub_stamp": "2015-05-12 22:50:16.011867",

              "log_size": 3001,

              "ondisk_log_size": 3001,

              "stats_invalid": "0",

              "stat_sum": { "num_bytes": 441982976,

                  "num_objects": 106,

                  "num_object_clones": 0,

                  "num_object_copies": 315,

                  "num_objects_missing_on_primary": 0,

                  "num_objects_degraded": 0,

                  "num_objects_misplaced": 0,

                  "num_objects_unfound": 0,

                  "num_objects_dirty": 11,

                  "num_whiteouts": 0,

                  "num_read": 61157,

                  "num_read_kb": 1281187,

                  "num_write": 135192,

                  "num_write_kb": 2422029,

                  "num_scrub_errors": 0,

                  "num_shallow_scrub_errors": 0,

                  "num_deep_scrub_errors": 0,

                  "num_objects_recovered": 79,

                  "num_bytes_recovered": 329883648,

                  "num_keys_recovered": 0,

                  "num_objects_omap": 0,

                  "num_objects_hit_set_archive": 0,

                  "num_bytes_hit_set_archive": 0},

              "stat_cat_sum": {},

              "up": [

                    8,

                    7],

              "acting": [

                    8,

                    7],

              "blocked_by": [],

              "up_primary": 8,

              "acting_primary": 8},

          "empty": 0,

          "dne": 0,

          "incomplete": 0,

          "last_epoch_started": 39536,

          "hit_set_history": { "current_last_update": "0'0",

              "current_last_stamp": "0.000000",

              "current_info": { "begin": "0.000000",

                  "end": "0.000000",

                  "version": "0'0"},

              "history": []}}],

  "recovery_state": [

        { "name": "Started\/Primary\/Active",

          "enter_time": "2015-05-18 10:18:37.449561",

          "might_have_unfound": [],

          "recovery_progress": { "backfill_targets": [],

              "waiting_on_backfill": [],

              "last_backfill_started": "0\/\/0\/\/-1",

              "backfill_info": { "begin": "0\/\/0\/\/-1",

                  "end": "0\/\/0\/\/-1",

                  "objects": []},

              "peer_backfill_info": [],

              "backfills_in_flight": [],

              "recovering": [],

              "pg_backend": { "pull_from_peer": [],

                  "pushing": []}},

          "scrub": { "scrubber.epoch_start": "39527",

              "scrubber.active": 0,

              "scrubber.block_writes": 0,

              "scrubber.waiting_on": 0,

              "scrubber.waiting_on_whom": []}},

        { "name": "Started",

          "enter_time": "2015-05-18 10:18:05.335040"}],

  "agent_state": {}}

On Mon, May 18, 2015 at 2:34 PM, Berant Lemmenes <berant@xxxxxxxxxxxx>
wrote:

> Sam,
>
> Thanks for taking a look. It does seem to fit my issue. Would just
> removing the 5.0_head directory be appropriate or would using
> ceph-objectstore-tool be better?
>
> Thanks,
> Berant
>
> On Mon, May 18, 2015 at 1:47 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>
>> You have most likely hit http://tracker.ceph.com/issues/11429.  There
>> are some workarounds in the bugs marked as duplicates of that bug, or you
>> can wait for the next hammer point release.
>> -Sam
>>
>> ----- Original Message -----
>> From: "Berant Lemmenes" <berant@xxxxxxxxxxxx>
>> To: ceph-users@xxxxxxxxxxxxxx
>> Sent: Monday, May 18, 2015 10:24:38 AM
>> Subject:  OSD unable to start (giant -> hammer)
>>
>> Hello all,
>>
>> I've encountered a problem when upgrading my single node home cluster
>> from giant to hammer, and I would greatly appreciate any insight.
>>
>> I upgraded the packages like normal, then proceeded to restart the mon
>> and once that came back restarted the first OSD (osd.3). However it
>> subsequently won't start and crashes with the following failed assertion:
>>
>>
>>
>> osd/OSD.h: 716: FAILED assert(ret)
>>
>> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x7f) [0xb1784f]
>>
>> 2: (OSD::load_pgs()+0x277b) [0x6850fb]
>>
>> 3: (OSD::init()+0x1448) [0x6930b8]
>>
>> 4: (main()+0x26b9) [0x62fd89]
>>
>> 5: (__libc_start_main()+0xed) [0x7f2345bc976d]
>>
>> 6: ceph-osd() [0x635679]
>>
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>>
>>
>>
>> --- logging levels ---
>>
>> 0/ 5 none
>>
>> 0/ 1 lockdep
>>
>> 0/ 1 context
>>
>> 1/ 1 crush
>>
>> 1/ 5 mds
>>
>> 1/ 5 mds_balancer
>>
>> 1/ 5 mds_locker
>>
>> 1/ 5 mds_log
>>
>> 1/ 5 mds_log_expire
>>
>> 1/ 5 mds_migrator
>>
>> 0/ 1 buffer
>>
>> 0/ 1 timer
>>
>> 0/ 1 filer
>>
>> 0/ 1 striper
>>
>> 0/ 1 objecter
>>
>> 0/ 5 rados
>>
>> 0/ 5 rbd
>>
>> 0/ 5 rbd_replay
>>
>> 0/ 5 journaler
>>
>> 0/ 5 objectcacher
>>
>> 0/ 5 client
>>
>> 0/ 5 osd
>>
>> 0/ 5 optracker
>>
>> 0/ 5 objclass
>>
>> 1/ 3 filestore
>>
>> 1/ 3 keyvaluestore
>>
>> 1/ 3 journal
>>
>> 0/ 5 ms
>>
>> 1/ 5 mon
>>
>> 0/10 monc
>>
>> 1/ 5 paxos
>>
>> 0/ 5 tp
>>
>> 1/ 5 auth
>>
>> 1/ 5 crypto
>>
>> 1/ 1 finisher
>>
>> 1/ 5 heartbeatmap
>>
>> 1/ 5 perfcounter
>>
>> 1/ 5 rgw
>>
>> 1/10 civetweb
>>
>> 1/ 5 javaclient
>>
>> 1/ 5 asok
>>
>> 1/ 1 throttle
>>
>> 0/ 0 refs
>>
>> 1/ 5 xio
>>
>> -2/-2 (syslog threshold)
>>
>> 99/99 (stderr threshold)
>>
>> max_recent 10000
>>
>> max_new 1000
>>
>> log_file
>>
>> --- end dump of recent events ---
>>
>> terminate called after throwing an instance of 'ceph::FailedAssertion'
>>
>> *** Caught signal (Aborted) **
>>
>> in thread 7f2347f71780
>>
>> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>
>> 1: ceph-osd() [0xa1fe55]
>>
>> 2: (()+0xfcb0) [0x7f2346fb1cb0]
>>
>> 3: (gsignal()+0x35) [0x7f2345bde0d5]
>>
>> 4: (abort()+0x17b) [0x7f2345be183b]
>>
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]
>>
>> 6: (()+0xb5846) [0x7f234652d846]
>>
>> 7: (()+0xb5873) [0x7f234652d873]
>>
>> 8: (()+0xb596e) [0x7f234652d96e]
>>
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x259) [0xb17a29]
>>
>> 10: (OSD::load_pgs()+0x277b) [0x6850fb]
>>
>> 11: (OSD::init()+0x1448) [0x6930b8]
>>
>> 12: (main()+0x26b9) [0x62fd89]
>>
>> 13: (__libc_start_main()+0xed) [0x7f2345bc976d]
>>
>> 14: ceph-osd() [0x635679]
>>
>> 2015-05-18 13:02:33.643064 7f2347f71780 -1 *** Caught signal (Aborted) **
>>
>> in thread 7f2347f71780
>>
>>
>>
>>
>> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>
>> 1: ceph-osd() [0xa1fe55]
>>
>> 2: (()+0xfcb0) [0x7f2346fb1cb0]
>>
>> 3: (gsignal()+0x35) [0x7f2345bde0d5]
>>
>> 4: (abort()+0x17b) [0x7f2345be183b]
>>
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]
>>
>> 6: (()+0xb5846) [0x7f234652d846]
>>
>> 7: (()+0xb5873) [0x7f234652d873]
>>
>> 8: (()+0xb596e) [0x7f234652d96e]
>>
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x259) [0xb17a29]
>>
>> 10: (OSD::load_pgs()+0x277b) [0x6850fb]
>>
>> 11: (OSD::init()+0x1448) [0x6930b8]
>>
>> 12: (main()+0x26b9) [0x62fd89]
>>
>> 13: (__libc_start_main()+0xed) [0x7f2345bc976d]
>>
>> 14: ceph-osd() [0x635679]
>>
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>>
>>
>>
>> --- begin dump of recent events ---
>>
>> 0> 2015-05-18 13:02:33.643064 7f2347f71780 -1 *** Caught signal (Aborted)
>> **
>>
>> in thread 7f2347f71780
>>
>>
>>
>>
>> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>
>> 1: ceph-osd() [0xa1fe55]
>>
>> 2: (()+0xfcb0) [0x7f2346fb1cb0]
>>
>> 3: (gsignal()+0x35) [0x7f2345bde0d5]
>>
>> 4: (abort()+0x17b) [0x7f2345be183b]
>>
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f234652f69d]
>>
>> 6: (()+0xb5846) [0x7f234652d846]
>>
>> 7: (()+0xb5873) [0x7f234652d873]
>>
>> 8: (()+0xb596e) [0x7f234652d96e]
>>
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x259) [0xb17a29]
>>
>> 10: (OSD::load_pgs()+0x277b) [0x6850fb]
>>
>> 11: (OSD::init()+0x1448) [0x6930b8]
>>
>> 12: (main()+0x26b9) [0x62fd89]
>>
>> 13: (__libc_start_main()+0xed) [0x7f2345bc976d]
>>
>> 14: ceph-osd() [0x635679]
>>
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>>
>>
>>
>> --- logging levels ---
>>
>> 0/ 5 none
>>
>> 0/ 1 lockdep
>>
>> 0/ 1 context
>>
>> 1/ 1 crush
>>
>> 1/ 5 mds
>>
>> 1/ 5 mds_balancer
>>
>> 1/ 5 mds_locker
>>
>> 1/ 5 mds_log
>>
>> 1/ 5 mds_log_expire
>>
>> 1/ 5 mds_migrator
>>
>> 0/ 1 buffer
>>
>> 0/ 1 timer
>>
>> 0/ 1 filer
>>
>> 0/ 1 striper
>>
>> 0/ 1 objecter
>>
>> 0/ 5 rados
>>
>> 0/ 5 rbd
>>
>> 0/ 5 rbd_replay
>>
>> 0/ 5 journaler
>>
>> 0/ 5 objectcacher
>>
>> 0/ 5 client
>>
>> 0/ 5 osd
>>
>> 0/ 5 optracker
>>
>> 0/ 5 objclass
>>
>> 1/ 3 filestore
>>
>> 1/ 3 keyvaluestore
>>
>> 1/ 3 journal
>>
>> 0/ 5 ms
>>
>> 1/ 5 mon
>>
>> 0/10 monc
>>
>> 1/ 5 paxos
>>
>> 0/ 5 tp
>>
>> 1/ 5 auth
>>
>> 1/ 5 crypto
>>
>> 1/ 1 finisher
>>
>> 1/ 5 heartbeatmap
>>
>> 1/ 5 perfcounter
>>
>> 1/ 5 rgw
>>
>> 1/10 civetweb
>>
>> 1/ 5 javaclient
>>
>> 1/ 5 asok
>>
>> 1/ 1 throttle
>>
>> 0/ 0 refs
>>
>> 1/ 5 xio
>>
>> -2/-2 (syslog threshold)
>>
>> 99/99 (stderr threshold)
>>
>> max_recent 10000
>>
>> max_new 1000
>>
>> log_file
>>
>> --- end dump of recent events ---
>>
>>
>> I've included a 'ceph osd dump' here:
>> http://pastebin.com/RKbaY7nv
>>
>> ceph osd tree:
>>
>>
>> ceph osd tree
>>
>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>
>> -1 24.14000 root default
>>
>> -3 0 rack unknownrack
>>
>> -2 0 host ceph-test
>>
>> -4 24.14000 host ceph01
>>
>> 0 1.50000 osd.0 down 0 1.00000
>>
>> 2 1.50000 osd.2 down 0 1.00000
>>
>> 3 1.50000 osd.3 down 1.00000 1.00000
>>
>> 5 2.00000 osd.5 up 1.00000 1.00000
>>
>> 6 2.00000 osd.6 up 1.00000 1.00000
>>
>> 7 2.00000 osd.7 up 1.00000 1.00000
>>
>> 8 2.00000 osd.8 up 1.00000 1.00000
>>
>> 9 2.00000 osd.9 up 1.00000 1.00000
>>
>> 10 2.00000 osd.10 up 1.00000 1.00000
>>
>> 4 4.00000 osd.4 up 1.00000 1.00000
>>
>> 1 3.64000 osd.1 up 1.00000 1.00000
>>
>>
>>
>>
>> Note that osd.0 and osd.2 were down prior to the upgrade and the cluster
>> was healthy (these are failed disks that have been out for some time just
>> not removed from CRUSH.
>>
>> I've also included a log with OSD debugging set to 20 here:
>>
>> https://dl.dropboxusercontent.com/u/1043493/osd.3.log.gz
>>
>>
>> Looking through that file, it appears the last pg that it loads
>> successfully is 2.3f6 then it moves to 5.0
>>
>> -3> 2015-05-18 12:25:24.292091 7f6f407f9780 10 osd.3 39533 load_pgs
>> loaded pg[2.3f6( v 39533'289849 (37945'286848,39533'289849] local-les=39532
>> n=99 ec=1 les/c 39532/39532 39531/39531/39523) [5,4,3] r=2 lpr=39533
>> pi=34961-39530/34 crt=39533'289846 lcod 0'0 inactive NOTIFY]
>> log((37945'286848,39533'289849], crt=39533'289846)
>>
>> -2> 2015-05-18 12:25:24.292100 7f6f407f9780 10 osd.3 39533 pgid 5.0 coll
>> 5.0_head
>>
>> -1> 2015-05-18 12:25:24.570188 7f6f407f9780 20 osd.3 0 get_map 34144 -
>> loading and decoding 0x411fd80
>>
>> 0> 2015-05-18 12:26:02.758914 7f6f407f9780 -1 osd/OSD.h: In function
>> 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f6f407f9780 time
>> 2015-05-18 12:25:24.620468
>>
>>
>>
>> osd/OSD.h: 716: FAILED assert(ret)
>>
>> [snip]
>>
>> Which I don't see 5.0 in a pg dump.
>>
>>
>>
>>
>> Thanks in advance!
>>
>> Berant
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux