Hello,
I had another blackout
with ceph today. It seems that ceph osd's fall from time to time
and they are unable to recover. I have 3 OSD's down now. 1
removed from the cluster and 2 down because I'm unable to recover
them.
We really need a recovery
tool. It's not normal that an OSD breaks and there's no way to
recover. Is there any way to do it?
Last one shows this:
] enter Reset
-12> 2017-11-25 20:34:19.548891 7f6e5dc158c0 5 osd.4
pg_epoch: 9686 pg[0.34(unlocked)] enter Initial
-11> 2017-11-25 20:34:19.548983 7f6e5dc158c0 5 osd.4
pg_epoch: 9686 pg[0.34( empty local-les=9685 n=0 ec=404 les/c/f
9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0
inactive NIBBLEWISE] exit Initial 0.000091 0 0.000000
-10> 2017-11-25 20:34:19.548994 7f6e5dc158c0 5 osd.4
pg_epoch: 9686 pg[0.34( empty local-les=9685 n=0 ec=404 les/c/f
9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0
inactive NIBBLEWISE] enter Reset
-9> 2017-11-25 20:34:19.549166 7f6e5dc158c0 5 osd.4
pg_epoch: 9686 pg[10.36(unlocked)] enter Initial
-8> 2017-11-25 20:34:19.566781 7f6e5dc158c0 5 osd.4
pg_epoch: 9686 pg[10.36( v 9686'7301894
(9686'7298879,9686'7301894] local-les=9685 n=534 ec=419 les/c/f
9685/9686/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=9686'7301894
lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.017614 0
0.000000
-7> 2017-11-25 20:34:19.566811 7f6e5dc158c0 5 osd.4
pg_epoch: 9686 pg[10.36( v 9686'7301894
(9686'7298879,9686'7301894] local-les=9685 n=534 ec=419 les/c/f
9685/9686/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=9686'7301894
lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset
-6> 2017-11-25 20:34:19.585411 7f6e5dc158c0 5 osd.4
pg_epoch: 9686 pg[8.5c(unlocked)] enter Initial
-5> 2017-11-25 20:34:19.602888 7f6e5dc158c0 5 osd.4
pg_epoch: 9686 pg[8.5c( empty local-les=9685 n=0 ec=348 les/c/f
9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0
inactive NIBBLEWISE] exit Initial 0.017478 0 0.000000
-4> 2017-11-25 20:34:19.602912 7f6e5dc158c0 5 osd.4
pg_epoch: 9686 pg[8.5c( empty local-les=9685 n=0 ec=348 les/c/f
9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0
inactive NIBBLEWISE] enter Reset
-3> 2017-11-25 20:34:19.603082 7f6e5dc158c0 5 osd.4
pg_epoch: 9686 pg[9.10(unlocked)] enter Initial
-2> 2017-11-25 20:34:19.615456 7f6e5dc158c0 5 osd.4
pg_epoch: 9686 pg[9.10( v 9686'2322547
(9031'2319518,9686'2322547] local-les=9685 n=261 ec=417 les/c/f
9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=9686'2322547
lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.012373 0
0.000000
-1> 2017-11-25 20:34:19.615481 7f6e5dc158c0 5 osd.4
pg_epoch: 9686 pg[9.10( v 9686'2322547
(9031'2319518,9686'2322547] local-les=9685 n=261 ec=417 les/c/f
9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=9686'2322547
lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset
0> 2017-11-25 20:34:19.617400 7f6e5dc158c0 -1 osd/PG.cc:
In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t,
epoch_t*, ceph::bufferlist*)' thread 7f6e5dc158c0 time
2017-11-25 20:34:19.615633
osd/PG.cc: 3025: FAILED assert(values.size() == 2)
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
1: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x80) [0x5562d318d790]
2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x661) [0x5562d2b4b601]
3: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa]
4: (OSD::init()+0x2026) [0x5562d2aaaca6]
5: (main()+0x2ef1) [0x5562d2a1c301]
6: (__libc_start_main()+0xf0) [0x7f6e5aa75830]
7: (_start()+0x29) [0x5562d2a5db09]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.4.log
--- end dump of recent events ---
2017-11-25 20:34:19.622559 7f6e5dc158c0 -1 *** Caught signal
(Aborted) **
in thread 7f6e5dc158c0 thread_name:ceph-osd
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
1: (()+0x98653e) [0x5562d308d53e]
2: (()+0x11390) [0x7f6e5caee390]
3: (gsignal()+0x38) [0x7f6e5aa8a428]
4: (abort()+0x16a) [0x7f6e5aa8c02a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x26b) [0x5562d318d97b]
6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x661) [0x5562d2b4b601]
7: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa]
8: (OSD::init()+0x2026) [0x5562d2aaaca6]
9: (main()+0x2ef1) [0x5562d2a1c301]
10: (__libc_start_main()+0xf0) [0x7f6e5aa75830]
11: (_start()+0x29) [0x5562d2a5db09]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
--- begin dump of recent events ---
0> 2017-11-25 20:34:19.622559 7f6e5dc158c0 -1 *** Caught
signal (Aborted) **
in thread 7f6e5dc158c0 thread_name:ceph-osd
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
1: (()+0x98653e) [0x5562d308d53e]
2: (()+0x11390) [0x7f6e5caee390]
3: (gsignal()+0x38) [0x7f6e5aa8a428]
4: (abort()+0x16a) [0x7f6e5aa8c02a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x26b) [0x5562d318d97b]
6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x661) [0x5562d2b4b601]
7: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa]
8: (OSD::init()+0x2026) [0x5562d2aaaca6]
9: (main()+0x2ef1) [0x5562d2a1c301]
10: (__libc_start_main()+0xf0) [0x7f6e5aa75830]
11: (_start()+0x29) [0x5562d2a5db09]
NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.4.log