Re: Another OSD broken today. How can I recover it?

Gonzalo Aguilar Delgado <gaguilar@xxxxxxxxxxxxxxxxxx> · Sun, 3 Dec 2017 13:31:11 +0100

    Hi, 

    Yes. Nice. Until all
        your OSD fails and you don't know what else to try. Looking at
        the faillure rates it will happen very soon. 

    I want to recover them.
        I'm writing in another mail what I tried. Let see if someone can
        help me. 

    I'm not doing anything.
        Just looking at my cluster from time to time to find that
        something else failed. I will do hard to recover this situation.

    Thank you. 

    On 26/11/17 16:13, Marc Roos wrote:

If I am not mistaken, the whole idea with the 3 replica's is dat you 
have enough copies to recover from a failed osd. In my tests this seems 
to go fine automatically. Are you doing something that is not adviced?

-----Original Message-----
From: Gonzalo Aguilar Delgado [mailto:gaguilar@xxxxxxxxxxxxxxxxxx] 
Sent: zaterdag 25 november 2017 20:44
To: 'ceph-users'
Subject:  Another OSD broken today. How can I recover it?

Hello, 

I had another blackout with ceph today. It seems that ceph osd's fall 
from time to time and they are unable to recover. I have 3 OSD's down 
now. 1 removed from the cluster and 2 down because I'm unable to recover 
them. 

We really need a recovery tool. It's not normal that an OSD breaks and 
there's no way to recover. Is there any way to do it?

Last one shows this:

] enter Reset
   -12> 2017-11-25 20:34:19.548891 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
pg[0.34(unlocked)] enter Initial
   -11> 2017-11-25 20:34:19.548983 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
pg[0.34( empty local-les=9685 n=0 ec=404 les/c/f 9685/9685/0 
9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] 
exit Initial 0.000091 0 0.000000
   -10> 2017-11-25 20:34:19.548994 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
pg[0.34( empty local-les=9685 n=0 ec=404 les/c/f 9685/9685/0 
9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] 
enter Reset
    -9> 2017-11-25 20:34:19.549166 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
pg[10.36(unlocked)] enter Initial
    -8> 2017-11-25 20:34:19.566781 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
pg[10.36( v 9686'7301894 (9686'7298879,9686'7301894] local-les=9685 
n=534 ec=419 les/c/f 9685/9686/0 9684/9684/9684) [4,0] r=0 lpr=0 
crt=9686'7301894 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 
0.017614 0 0.000000
    -7> 2017-11-25 20:34:19.566811 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
pg[10.36( v 9686'7301894 (9686'7298879,9686'7301894] local-les=9685 
n=534 ec=419 les/c/f 9685/9686/0 9684/9684/9684) [4,0] r=0 lpr=0 
crt=9686'7301894 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset
    -6> 2017-11-25 20:34:19.585411 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
pg[8.5c(unlocked)] enter Initial
    -5> 2017-11-25 20:34:19.602888 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
pg[8.5c( empty local-les=9685 n=0 ec=348 les/c/f 9685/9685/0 
9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] 
exit Initial 0.017478 0 0.000000
    -4> 2017-11-25 20:34:19.602912 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
pg[8.5c( empty local-les=9685 n=0 ec=348 les/c/f 9685/9685/0 
9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] 
enter Reset
    -3> 2017-11-25 20:34:19.603082 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
pg[9.10(unlocked)] enter Initial
    -2> 2017-11-25 20:34:19.615456 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
pg[9.10( v 9686'2322547 (9031'2319518,9686'2322547] local-les=9685 n=261 
ec=417 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 
crt=9686'2322547 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 
0.012373 0 0.000000
    -1> 2017-11-25 20:34:19.615481 7f6e5dc158c0  5 osd.4 pg_epoch: 9686 
pg[9.10( v 9686'2322547 (9031'2319518,9686'2322547] local-les=9685 n=261 
ec=417 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 
crt=9686'2322547 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset
     0> 2017-11-25 20:34:19.617400 7f6e5dc158c0 -1 osd/PG.cc: In 
function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, 
ceph::bufferlist*)' thread 7f6e5dc158c0 time 2017-11-25 20:34:19.615633
osd/PG.cc: 3025: FAILED assert(values.size() == 2)

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x5562d318d790]
 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, 
ceph::buffer::list*)+0x661) [0x5562d2b4b601]
 3: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa]
 4: (OSD::init()+0x2026) [0x5562d2aaaca6]
 5: (main()+0x2ef1) [0x5562d2a1c301]
 6: (__libc_start_main()+0xf0) [0x7f6e5aa75830]
 7: (_start()+0x29) [0x5562d2a5db09]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.4.log
--- end dump of recent events ---
2017-11-25 20:34:19.622559 7f6e5dc158c0 -1 *** Caught signal (Aborted) 
**  in thread 7f6e5dc158c0 thread_name:ceph-osd

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (()+0x98653e) [0x5562d308d53e]
 2: (()+0x11390) [0x7f6e5caee390]
 3: (gsignal()+0x38) [0x7f6e5aa8a428]
 4: (abort()+0x16a) [0x7f6e5aa8c02a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x26b) [0x5562d318d97b]
 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, 
ceph::buffer::list*)+0x661) [0x5562d2b4b601]
 7: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa]
 8: (OSD::init()+0x2026) [0x5562d2aaaca6]
 9: (main()+0x2ef1) [0x5562d2a1c301]
 10: (__libc_start_main()+0xf0) [0x7f6e5aa75830]
 11: (_start()+0x29) [0x5562d2a5db09]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- begin dump of recent events ---
     0> 2017-11-25 20:34:19.622559 7f6e5dc158c0 -1 *** Caught signal 
(Aborted) **  in thread 7f6e5dc158c0 thread_name:ceph-osd

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (()+0x98653e) [0x5562d308d53e]
 2: (()+0x11390) [0x7f6e5caee390]
 3: (gsignal()+0x38) [0x7f6e5aa8a428]
 4: (abort()+0x16a) [0x7f6e5aa8c02a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x26b) [0x5562d318d97b]
 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, 
ceph::buffer::list*)+0x661) [0x5562d2b4b601]
 7: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa]
 8: (OSD::init()+0x2026) [0x5562d2aaaca6]
 9: (main()+0x2ef1) [0x5562d2a1c301]
 10: (__libc_start_main()+0xf0) [0x7f6e5aa75830]
 11: (_start()+0x29) [0x5562d2a5db09]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.4.log

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com