Re: osd fails to start, rbd hangs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/06/2015 09:25 PM, Gregory Farnum wrote:
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
> 
> :)
> 

Thanks, I tried to follow the advice to "... start that ceph-osd and
things will recover.", for the better part of the last two days but did
not succeed in reviving the crashed osd :(
I do not understand the message the osd is giving, since the files
appear to be there:

beta ~ # ls -lrt /var/lib/ceph/osd/ceph-2/
total 1048656
-rw-r--r-- 1 root root         37 Oct 26 16:25 fsid
-rw-r--r-- 1 root root          4 Oct 26 16:25 store_version
-rw-r--r-- 1 root root         53 Oct 26 16:25 superblock
-rw-r--r-- 1 root root         21 Oct 26 16:25 magic
-rw-r--r-- 1 root root          2 Oct 26 16:25 whoami
-rw-r--r-- 1 root root         37 Oct 26 16:25 ceph_fsid
-rw-r--r-- 1 root root          6 Oct 26 16:25 ready
-rw------- 1 root root         56 Oct 26 16:25 keyring
drwxr-xr-x 1 root root        752 Oct 26 16:47 snap_16793
drwxr-xr-x 1 root root        752 Oct 26 16:47 snap_16773
drwxr-xr-x 1 root root        230 Oct 30 01:01 snap_242352
drwxr-xr-x 1 root root        230 Oct 30 01:01 snap_242378
-rw-r--r-- 1 root root 1073741824 Oct 30 01:02 journal
drwxr-xr-x 1 root root        256 Nov  6 21:55 current

as well as a subvolume:

btrfs subvolume list /var/lib/ceph/osd/ceph-2/
ID 8005 gen 8336 top level 5 path snap_242352
ID 8006 gen 8467 top level 5 path snap_242378
ID 8070 gen 8468 top level 5 path current

still the osd complains says "current/ missing entirely (unusual, but
okay)" and then completely fails to mount the object store.
Is this somethig where to give up on the osd completely, mark it as lost
and try to go on from there?
The machine on which the osd runs did not have any other issues, only
the osd apparently self destructed ~3.5 days after it was added.

Or is the recovery of the osd simple (enough) and I just missed the
point somewhere? ;)

thanks in advance
	Philipp

The log of an attempted start of the osd continues to give:

2015-11-06 21:41:53.213174 7f44755a77c0  0 ceph version 0.94.3
(95cefea9fd9ab740263bf8bb4796fd864d9afe2b), process ceph-osd, pid 3751
2015-11-06 21:41:53.254418 7f44755a77c0 10
filestore(/var/lib/ceph/osd/ceph-2) dump_stop
2015-11-06 21:41:53.275694 7f44755a77c0 10
ErasureCodePluginSelectJerasure: load: jerasure_sse4
2015-11-06 21:41:53.291133 7f44755a77c0 10 load: jerasure load: lrc
2015-11-06 21:41:53.291543 7f44755a77c0  5
filestore(/var/lib/ceph/osd/ceph-2) test_mount basedir
/var/lib/ceph/osd/ceph-2 journal /var/lib/ceph/osd/ceph-2/journal
2015-11-06 21:41:53.292043 7f44755a77c0  2 osd.2 0 mounting
/var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
2015-11-06 21:41:53.292152 7f44755a77c0  5
filestore(/var/lib/ceph/osd/ceph-2) basedir /var/lib/ceph/osd/ceph-2
journal /var/lib/ceph/osd/ceph-2/journal
2015-11-06 21:41:53.292216 7f44755a77c0 10
filestore(/var/lib/ceph/osd/ceph-2) mount fsid is
2662df9c-fd60-425c-ac89-4fe07a2a1b2f
2015-11-06 21:41:53.292412 7f44755a77c0  0
filestore(/var/lib/ceph/osd/ceph-2) backend btrfs (magic 0x9123683e)
2015-11-06 21:41:59.753329 7f44755a77c0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features:
FIEMAP ioctl is supported and appears to work
2015-11-06 21:41:59.753395 7f44755a77c0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-11-06 21:42:00.968438 7f44755a77c0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2015-11-06 21:42:00.969431 7f44755a77c0  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature:
CLONE_RANGE ioctl is supported
2015-11-06 21:42:03.033742 7f44755a77c0  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature:
SNAP_CREATE is supported
2015-11-06 21:42:03.034262 7f44755a77c0  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature:
SNAP_DESTROY is supported
2015-11-06 21:42:03.042168 7f44755a77c0  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature:
START_SYNC is supported (transid 8453)
2015-11-06 21:42:04.144516 7f44755a77c0  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature:
WAIT_SYNC is supported
2015-11-06 21:42:04.309323 7f44755a77c0  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature:
SNAP_CREATE_V2 is supported
2015-11-06 21:42:04.310562 7f44755a77c0 10
filestore(/var/lib/ceph/osd/ceph-2)  current/ missing entirely (unusual,
but okay)
2015-11-06 21:42:04.310686 7f44755a77c0 10
filestore(/var/lib/ceph/osd/ceph-2)  most recent snap from
<242352,242378> is 242378
2015-11-06 21:42:04.310763 7f44755a77c0 10
filestore(/var/lib/ceph/osd/ceph-2) mount rolling back to consistent
snap 242378
2015-11-06 21:42:04.310812 7f44755a77c0 10
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) rollback_to: to
'snap_242378'
2015-11-06 21:42:06.384894 7f44755a77c0  5
filestore(/var/lib/ceph/osd/ceph-2) mount op_seq is 0
2015-11-06 21:42:06.384968 7f44755a77c0 -1
filestore(/var/lib/ceph/osd/ceph-2) mount initial op seq is 0; something
is wrong
2015-11-06 21:42:06.385027 7f44755a77c0 -1 osd.2 0 OSD:init: unable to
mount object store
2015-11-06 21:42:06.385076 7f44755a77c0 -1  ** ERROR: osd init failed:
(22) Invalid argument



> On Friday, November 6, 2015, Philipp Schwaha <philipp@xxxxxxxxxxx
> <mailto:philipp@xxxxxxxxxxx>> wrote:
> 
>     Hi,
> 
>     I have an issue with my (small) ceph cluster after an osd failed.
>     ceph -s reports the following:
>         cluster 2752438a-a33e-4df4-b9ec-beae32d00aad
>          health HEALTH_WARN
>                 31 pgs down
>                 31 pgs peering
>                 31 pgs stuck inactive
>                 31 pgs stuck unclean
>          monmap e1: 1 mons at {0=192.168.19.13:6789/0
>     <http://192.168.19.13:6789/0>}
>                 election epoch 1, quorum 0 0
>          osdmap e138: 3 osds: 2 up, 2 in
>           pgmap v77979: 64 pgs, 1 pools, 844 GB data, 211 kobjects
>                 1290 GB used, 8021 GB / 9315 GB avail
>                       33 active+clean
>                       31 down+peering
> 
>     I am now unable to map the rbd image; the command will just time out.
>     The log is at the end of the message.
> 
>     Is there a way to recover the osd / the ceph cluster from this?
> 
>     thanks in advance
>             Philipp
> 
> 
> 
>         -2> 2015-10-30 01:04:59.689116 7f4bb741e700  1 heartbeat_map
>     is_healthy 'OSD::osd_tp thread 0x7f4ba13cd700' had timed out after 15
>         -1> 2015-10-30 01:04:59.689140 7f4bb741e700  1 heartbeat_map
>     is_healthy 'OSD::osd_tp thread 0x7f4ba13cd700' had suicide timed out
>     after 150
>          0> 2015-10-30 01:04:59.906546 7f4bb741e700 -1
>     common/HeartbeatMap.cc: In function 'bool
>     ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
>     time_t)' thread 7f4bb741e700 time 2015-10-30 01:04:59.689176
>     common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
> 
>      ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>      1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>     const*)+0x77) [0xb12457]
>      2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
>     long)+0x119) [0xa47179]
>      3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xa47b76]
>      4: (ceph::HeartbeatMap::check_touch_file()+0x18) [0xa48258]
>      5: (CephContextServiceThread::entry()+0x164) [0xb21974]
>      6: (()+0x76f5) [0x7f4bbdb0c6f5]
>      7: (__clone()+0x6d) [0x7f4bbc09cedd]
>      NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>     needed to interpret this.
> 
>     --- logging levels ---
>        0/ 5 none
>        0/ 1 lockdep
>        0/ 1 context
>        1/ 1 crush
>        1/ 5 mds
>        1/ 5 mds_balancer
>        1/ 5 mds_locker
>        1/ 5 mds_log
>        1/ 5 mds_log_expire
>        1/ 5 mds_migrator
>        0/ 1 buffer
>        0/ 1 timer
>        0/ 1 filer
>        0/ 1 striper
>        0/ 1 objecter
>        0/ 5 rados
>        0/ 5 rbd
>        0/ 5 rbd_replay
>        0/ 5 journaler
>        0/ 5 objectcacher
>        0/ 5 client
>        0/ 5 osd
>        0/ 5 optracker
>        0/ 5 objclass
>        1/ 3 filestore
>        1/ 3 keyvaluestore
>        1/ 3 journal
>        0/ 5 ms
>        1/ 5 mon
>        0/10 monc
>        1/ 5 paxos
>        0/ 5 tp
>        1/ 5 auth
>        1/ 5 crypto
>        1/ 1 finisher
>        1/ 5 heartbeatmap
>        1/ 5 perfcounter
>        1/ 5 rgw
>        1/10 civetweb
>        1/ 5 javaclient
>        1/ 5 asok
>        1/ 1 throttle
>        0/ 0 refs
>        1/ 5 xio
>       -2/-2 (syslog threshold)
>       -1/-1 (stderr threshold)
>       max_recent     10000
>       max_new         1000
>       log_file /var/log/ceph/ceph-osd.2.log
>     --- end dump of recent events ---
>     2015-10-30 01:05:00.193324 7f4bb741e700 -1 *** Caught signal
>     (Aborted) **
>      in thread 7f4bb741e700
> 
>      ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>      1: /usr/bin/ceph-osd() [0xa11c84]
>      2: (()+0x10690) [0x7f4bbdb15690]
>      3: (gsignal()+0x37) [0x7f4bbbfe63c7]
>      4: (abort()+0x16a) [0x7f4bbbfe77fa]
>      5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f4bbc8c7d45]
>      6: (()+0x5dda7) [0x7f4bbc8c5da7]
>      7: (()+0x5ddf2) [0x7f4bbc8c5df2]
>      8: (()+0x5e008) [0x7f4bbc8c6008]
>      9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>     const*)+0x252) [0xb12632]
>      10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
>     long)+0x119) [0xa47179]
>      11: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xa47b76]
>      12: (ceph::HeartbeatMap::check_touch_file()+0x18) [0xa48258]
>      13: (CephContextServiceThread::entry()+0x164) [0xb21974]
>      14: (()+0x76f5) [0x7f4bbdb0c6f5]
>      15: (__clone()+0x6d) [0x7f4bbc09cedd]
>      NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>     needed to interpret this.
> 
>     --- begin dump of recent events ---
>          0> 2015-10-30 01:05:00.193324 7f4bb741e700 -1 *** Caught signal
>     (Aborted) **
>      in thread 7f4bb741e700
> 
>      ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>      1: /usr/bin/ceph-osd() [0xa11c84]
>      2: (()+0x10690) [0x7f4bbdb15690]
>      3: (gsignal()+0x37) [0x7f4bbbfe63c7]
>      4: (abort()+0x16a) [0x7f4bbbfe77fa]
>      5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f4bbc8c7d45]
>      6: (()+0x5dda7) [0x7f4bbc8c5da7]
>      7: (()+0x5ddf2) [0x7f4bbc8c5df2]
>      8: (()+0x5e008) [0x7f4bbc8c6008]
>      9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>     const*)+0x252) [0xb12632]
>      10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
>     long)+0x119) [0xa4
>     7179]
>      11: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xa47b76]
>      12: (ceph::HeartbeatMap::check_touch_file()+0x18) [0xa48258]
>      13: (CephContextServiceThread::entry()+0x164) [0xb21974]
>      14: (()+0x76f5) [0x7f4bbdb0c6f5]
>      15: (__clone()+0x6d) [0x7f4bbc09cedd]
>      NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>     needed to interpret this
>     .
> 
>     --- begin dump of recent events ---
>          0> 2015-10-30 01:05:00.193324 7f4bb741e700 -1 *** Caught signal
>     (Aborted) **
>      in thread 7f4bb741e700
> 
>      ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>      1: /usr/bin/ceph-osd() [0xa11c84]
>      2: (()+0x10690) [0x7f4bbdb15690]
>      3: (gsignal()+0x37) [0x7f4bbbfe63c7]
>      4: (abort()+0x16a) [0x7f4bbbfe77fa]
>      5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f4bbc8c7d45]
>      6: (()+0x5dda7) [0x7f4bbc8c5da7]
>      7: (()+0x5ddf2) [0x7f4bbc8c5df2]
>      8: (()+0x5e008) [0x7f4bbc8c6008]
>      9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>     const*)+0x252) [0xb12632]
>      10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
>     long)+0x119) [0xa4
>     7179]
>      11: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xa47b76]
>      12: (ceph::HeartbeatMap::check_touch_file()+0x18) [0xa48258]
>      13: (CephContextServiceThread::entry()+0x164) [0xb21974]
>      14: (()+0x76f5) [0x7f4bbdb0c6f5]
>      15: (__clone()+0x6d) [0x7f4bbc09cedd]
>      NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>     needed to interpret this
>     .
> 
>     --- logging levels ---
>        0/ 5 none
>        0/ 1 lockdep
>        0/ 1 context
>        1/ 1 crush
>        1/ 5 mds
>        1/ 5 mds_balancer
>        1/ 5 mds_locker
>        1/ 5 mds_log
>        1/ 5 mds_log_expire
>        1/ 5 mds_migrator
>        0/ 1 buffer
>        0/ 1 timer
>        0/ 1 filer
>        0/ 1 striper
>        0/ 1 objecter
>        0/ 5 rados
>        0/ 5 rbd
>        0/ 5 rbd_replay
>        0/ 5 journaler
>        0/ 5 objectcacher
>        0/ 5 client
>        0/ 5 osd
>        0/ 5 optracker
>        0/ 5 objclass
>        1/ 3 filestore
>        1/ 3 keyvaluestore
>        1/ 3 journal
>        0/ 5 ms
>        1/ 5 mon
>        0/10 monc
>        1/ 5 paxos
>        0/ 5 tp
>        1/ 5 auth
>        1/ 5 crypto
>        1/ 1 finisher
>        1/ 5 heartbeatmap
>        1/ 5 perfcounter
>        1/ 5 rgw
>        1/10 civetweb
>        1/ 5 javaclient
>        1/ 5 asok
>        1/ 1 throttle
>        0/ 0 refs
>        1/ 5 xio
>       -2/-2 (syslog threshold)
>       -1/-1 (stderr threshold)
>       max_recent     10000
>       max_new         1000
>       log_file /var/log/ceph/ceph-osd.2.log
>     --- end dump of recent events ---
>     2015-10-30 01:07:00.920675 7f0ed0d067c0  0 ceph version 0.94.3
>     (95cefea9fd9ab740263bf8bb479
>     6fd864d9afe2b), process ceph-osd, pid 14210
>     2015-10-30 01:07:01.096259 7f0ed0d067c0  0
>     filestore(/var/lib/ceph/osd/ceph-2) backend btrf
>     s (magic 0x9123683e)
>     2015-10-30 01:07:01.099472 7f0ed0d067c0  0
>     genericfilestorebackend(/var/lib/ceph/osd/ceph-2
>     ) detect_features: FIEMAP ioctl is supported and appears to work
>     2015-10-30 01:07:01.099511 7f0ed0d067c0  0
>     genericfilestorebackend(/var/lib/ceph/osd/ceph-2
>     ) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap'
>     config option
>     2015-10-30 01:07:02.681342 7f0ed0d067c0  0
>     genericfilestorebackend(/var/lib/ceph/osd/ceph-2
>     ) detect_features: syncfs(2) syscall fully supported (by glibc and
>     kernel)
>     2015-10-30 01:07:02.682285 7f0ed0d067c0  0
>     btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2)
>     detect_feature: CLONE_RANGE ioctl is supported
>     2015-10-30 01:07:04.508905 7f0ed0d067c0  0
>     btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2)    1/ 3 filestore
>        1/ 3 keyvaluestore
>        1/ 3 journal
>        0/ 5 ms
>        1/ 5 mon
>        0/10 monc
>        1/ 5 paxos
>        0/ 5 tp
>        1/ 5 auth
>        1/ 5 crypto
>        1/ 1 finisher
>        1/ 5 heartbeatmap
>        1/ 5 perfcounter
>        1/ 5 rgw
>        1/10 civetweb
>        1/ 5 javaclient
>        1/ 5 asok
>        1/ 1 throttle
>        0/ 0 refs
>        1/ 5 xio
>       -2/-2 (syslog threshold)
>       -1/-1 (stderr threshold)
>       max_recent     10000
>       max_new         1000
>       log_file /var/log/ceph/ceph-osd.2.log
>     --- end dump of recent events ---
>     2015-10-30 01:07:00.920675 7f0ed0d067c0  0 ceph version 0.94.3
>     (95cefea9fd9ab740263bf8bb479
>     6fd864d9afe2b), process ceph-osd, pid 14210
>     2015-10-30 01:07:01.096259 7f0ed0d067c0  0
>     filestore(/var/lib/ceph/osd/ceph-2) backend btrf
>     s (magic 0x9123683e)
>     2015-10-30 01:07:01.099472 7f0ed0d067c0  0
>     genericfilestorebackend(/var/lib/ceph/osd/ceph-2
>     ) detect_features: FIEMAP ioctl is supported and appears to work
>     2015-10-30 01:07:01.099511 7f0ed0d067c0  0
>     genericfilestorebackend(/var/lib/ceph/osd/ceph-2
>     ) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap'
>     config option
>     2015-10-30 01:07:02.681342 7f0ed0d067c0  0
>     genericfilestorebackend(/var/lib/ceph/osd/ceph-2
>     ) detect_features: syncfs(2) syscall fully supported (by glibc and
>     kernel)
>     2015-10-30 01:07:02.682285 7f0ed0d067c0  0
>     btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2)
>     detect_feature: CLONE_RANGE ioctl is supported
>     2015-10-30 01:07:04.508905 7f0ed0d067c0  0
>     btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2)
>     detect_feature: SNAP_CREATE is supported
>     2015-10-30 01:07:04.509418 7f0ed0d067c0  0
>     btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2)
>     detect_feature: SNAP_DESTROY is supported
>     2015-10-30 01:07:04.518728 7f0ed0d067c0  0
>     btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature:
>     START_SYNC is supported (transid 8343)
>     2015-10-30 01:07:05.524109 7f0ed0d067c0  0
>     btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature:
>     WAIT_SYNC is supported
>     2015-10-30 01:07:05.705014 7f0ed0d067c0  0
>     btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature:
>     SNAP_CREATE_V2 is supported
>     2015-10-30 01:07:06.051275 7f0ed0d067c0  0
>     btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) rollback_to: error
>     removing old current subvol: (1) Operation not permitted
>     2015-10-30 01:07:07.655679 7f0ed0d067c0 -1
>     filestore(/var/lib/ceph/osd/ceph-2) mount initial op seq is 0; something
>     is wrong
>     2015-10-30 01:07:07.655801 7f0ed0d067c0 -1 osd.2 0 OSD:init: unable to
>     mount object store
>     2015-10-30 01:07:07.655821 7f0ed0d067c0 -1 ESC[0;31m ** ERROR: osd init
>     failed: (22) Invalid argumentESC[0m
> 
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <javascript:;>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux