Hi, Yes. Nice. Until all
your OSD fails and you don't know what else to try. Looking at
the faillure rates it will happen very soon. I want to recover them.
I'm writing in another mail what I tried. Let see if someone can
help me. I'm not doing anything.
Just looking at my cluster from time to time to find that
something else failed. I will do hard to recover this situation.
Thank you. On 26/11/17 16:13, Marc Roos wrote:
If I am not mistaken, the whole idea with the 3 replica's is dat you have enough copies to recover from a failed osd. In my tests this seems to go fine automatically. Are you doing something that is not adviced? -----Original Message----- From: Gonzalo Aguilar Delgado [mailto:gaguilar@xxxxxxxxxxxxxxxxxx] Sent: zaterdag 25 november 2017 20:44 To: 'ceph-users' Subject: Another OSD broken today. How can I recover it? Hello, I had another blackout with ceph today. It seems that ceph osd's fall from time to time and they are unable to recover. I have 3 OSD's down now. 1 removed from the cluster and 2 down because I'm unable to recover them. We really need a recovery tool. It's not normal that an OSD breaks and there's no way to recover. Is there any way to do it? Last one shows this: ] enter Reset -12> 2017-11-25 20:34:19.548891 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 pg[0.34(unlocked)] enter Initial -11> 2017-11-25 20:34:19.548983 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 pg[0.34( empty local-les=9685 n=0 ec=404 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.000091 0 0.000000 -10> 2017-11-25 20:34:19.548994 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 pg[0.34( empty local-les=9685 n=0 ec=404 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset -9> 2017-11-25 20:34:19.549166 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 pg[10.36(unlocked)] enter Initial -8> 2017-11-25 20:34:19.566781 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 pg[10.36( v 9686'7301894 (9686'7298879,9686'7301894] local-les=9685 n=534 ec=419 les/c/f 9685/9686/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=9686'7301894 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.017614 0 0.000000 -7> 2017-11-25 20:34:19.566811 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 pg[10.36( v 9686'7301894 (9686'7298879,9686'7301894] local-les=9685 n=534 ec=419 les/c/f 9685/9686/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=9686'7301894 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset -6> 2017-11-25 20:34:19.585411 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 pg[8.5c(unlocked)] enter Initial -5> 2017-11-25 20:34:19.602888 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 pg[8.5c( empty local-les=9685 n=0 ec=348 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.017478 0 0.000000 -4> 2017-11-25 20:34:19.602912 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 pg[8.5c( empty local-les=9685 n=0 ec=348 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset -3> 2017-11-25 20:34:19.603082 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 pg[9.10(unlocked)] enter Initial -2> 2017-11-25 20:34:19.615456 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 pg[9.10( v 9686'2322547 (9031'2319518,9686'2322547] local-les=9685 n=261 ec=417 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=9686'2322547 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.012373 0 0.000000 -1> 2017-11-25 20:34:19.615481 7f6e5dc158c0 5 osd.4 pg_epoch: 9686 pg[9.10( v 9686'2322547 (9031'2319518,9686'2322547] local-les=9685 n=261 ec=417 les/c/f 9685/9685/0 9684/9684/9684) [4,0] r=0 lpr=0 crt=9686'2322547 lcod 0'0 mlcod 0'0 inactive NIBBLEWISE] enter Reset 0> 2017-11-25 20:34:19.617400 7f6e5dc158c0 -1 osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f6e5dc158c0 time 2017-11-25 20:34:19.615633 osd/PG.cc: 3025: FAILED assert(values.size() == 2) ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5562d318d790] 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x661) [0x5562d2b4b601] 3: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa] 4: (OSD::init()+0x2026) [0x5562d2aaaca6] 5: (main()+0x2ef1) [0x5562d2a1c301] 6: (__libc_start_main()+0xf0) [0x7f6e5aa75830] 7: (_start()+0x29) [0x5562d2a5db09] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.4.log --- end dump of recent events --- 2017-11-25 20:34:19.622559 7f6e5dc158c0 -1 *** Caught signal (Aborted) ** in thread 7f6e5dc158c0 thread_name:ceph-osd ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) 1: (()+0x98653e) [0x5562d308d53e] 2: (()+0x11390) [0x7f6e5caee390] 3: (gsignal()+0x38) [0x7f6e5aa8a428] 4: (abort()+0x16a) [0x7f6e5aa8c02a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0x5562d318d97b] 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x661) [0x5562d2b4b601] 7: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa] 8: (OSD::init()+0x2026) [0x5562d2aaaca6] 9: (main()+0x2ef1) [0x5562d2a1c301] 10: (__libc_start_main()+0xf0) [0x7f6e5aa75830] 11: (_start()+0x29) [0x5562d2a5db09] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2017-11-25 20:34:19.622559 7f6e5dc158c0 -1 *** Caught signal (Aborted) ** in thread 7f6e5dc158c0 thread_name:ceph-osd ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) 1: (()+0x98653e) [0x5562d308d53e] 2: (()+0x11390) [0x7f6e5caee390] 3: (gsignal()+0x38) [0x7f6e5aa8a428] 4: (abort()+0x16a) [0x7f6e5aa8c02a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0x5562d318d97b] 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x661) [0x5562d2b4b601] 7: (OSD::load_pgs()+0x75a) [0x5562d2a9f8aa] 8: (OSD::init()+0x2026) [0x5562d2aaaca6] 9: (main()+0x2ef1) [0x5562d2a1c301] 10: (__libc_start_main()+0xf0) [0x7f6e5aa75830] 11: (_start()+0x29) [0x5562d2a5db09] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.4.log |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com