it seems like a leveldb problem. could you just kick it out and add a new osd to make cluster healthy firstly? On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitsch <gerd@xxxxxxxxxxxxx> wrote: > > > Dear all, > > I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75% > usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and > since then I got several OSDs crashing and never recovering: trying to run > it, ends up crashing as follows. > > Is this problem known? Is there any configuration that should be checked? > Any way to try to recover these OSDs without losing all data? > > After that, setting the OSD to lost, I got one incomplete, inactive PG. Is > there any way to recover it? Data still exists in crashed OSDs. > > Regards. > > [(12:58:13) root@spcsnp3 ~]# service ceph start osd.7 > === osd.7 === > 2015-08-11 12:58:21.003876 7f17ed52b700 1 monclient(hunting): found > mon.spcsmp2 > 2015-08-11 12:58:21.003915 7f17ef493700 5 monclient: authenticate success, > global_id 206010466 > create-or-move updated item name 'osd.7' weight 3.64 at location > {host=spcsnp3,root=default} to crush map > Starting Ceph osd.7 on spcsnp3... > 2015-08-11 12:58:21.279878 7f200fa8f780 0 ceph version 0.80.10 > (ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918 > starting osd.7 at :/0 osd_data /var/lib/ceph/osd/ceph-7 > /var/lib/ceph/osd/ceph-7/journal > [(12:58:21) root@spcsnp3 ~]# 2015-08-11 12:58:21.348094 7f200fa8f780 10 > filestore(/var/lib/ceph/osd/ceph-7) dump_stop > 2015-08-11 12:58:21.348291 7f200fa8f780 5 > filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal > /var/lib/ceph/osd/ceph-7/journal > 2015-08-11 12:58:21.348326 7f200fa8f780 10 > filestore(/var/lib/ceph/osd/ceph-7) mount fsid is > 54c136da-c51c-4799-b2dc-b7988982ee00 > 2015-08-11 12:58:21.349010 7f200fa8f780 0 > filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs) > 2015-08-11 12:58:21.349026 7f200fa8f780 1 > filestore(/var/lib/ceph/osd/ceph-7) disabling 'filestore replica fadvise' > due to known issues with fadvise(DONTNEED) on xfs > 2015-08-11 12:58:21.353277 7f200fa8f780 0 > genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP > ioctl is supported and appears to work > 2015-08-11 12:58:21.353302 7f200fa8f780 0 > genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP > ioctl is disabled via 'filestore fiemap' config option > 2015-08-11 12:58:21.362106 7f200fa8f780 0 > genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: > syscall(SYS_syncfs, fd) fully supported > 2015-08-11 12:58:21.362195 7f200fa8f780 0 > xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is > disabled by conf > 2015-08-11 12:58:21.362701 7f200fa8f780 5 > filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995 > 2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal (Aborted) ** > in thread 7f200fa8f780 > > ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70) > 1: /usr/bin/ceph-osd() [0xab7562] > 2: (()+0xf0a0) [0x7f200efcd0a0] > 3: (gsignal()+0x35) [0x7f200db3f165] > 4: (abort()+0x180) [0x7f200db423e0] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d] > 6: (()+0x63996) [0x7f200e393996] > 7: (()+0x639c3) [0x7f200e3939c3] > 8: (()+0x63bee) [0x7f200e393bee] > 9: (tc_new()+0x48e) [0x7f200f213aee] > 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, > std::allocator<char> const&)+0x59) [0x7f200e3ef999] > 11: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned > long)+0x28) [0x7f200e3f0708] > 12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0] > 13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5] > 14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2) > [0x7f200f46ffa2] > 15: (leveldb::DBImpl::RecoverLogFile(unsigned long, leveldb::VersionEdit*, > unsigned long*)+0x180) [0x7f200f468360] > 16: (leveldb::DBImpl::Recover(leveldb::VersionEdit*)+0x5c2) > [0x7f200f46adf2] > 17: (leveldb::DB::Open(leveldb::Options const&, std::string const&, > leveldb::DB**)+0xff) [0x7f200f46b11f] > 18: (LevelDBStore::do_open(std::ostream&, bool)+0xd8) [0xa123a8] > 19: (FileStore::mount()+0x18e0) [0x9b7080] > 20: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78f52a] > 21: (main()+0x2234) [0x7331c4] > 22: (__libc_start_main()+0xfd) [0x7f200db2bead] > 23: /usr/bin/ceph-osd() [0x736e99] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- begin dump of recent events --- > -66> 2015-08-11 12:58:21.277524 7f200fa8f780 5 asok(0x2800230) > register_command perfcounters_dump hook 0x27f0010 > -65> 2015-08-11 12:58:21.277552 7f200fa8f780 5 asok(0x2800230) > register_command 1 hook 0x27f0010 > -64> 2015-08-11 12:58:21.277556 7f200fa8f780 5 asok(0x2800230) > register_command perf dump hook 0x27f0010 > -63> 2015-08-11 12:58:21.277561 7f200fa8f780 5 asok(0x2800230) > register_command perfcounters_schema hook 0x27f0010 > -62> 2015-08-11 12:58:21.277564 7f200fa8f780 5 asok(0x2800230) > register_command 2 hook 0x27f0010 > -61> 2015-08-11 12:58:21.277566 7f200fa8f780 5 asok(0x2800230) > register_command perf schema hook 0x27f0010 > -60> 2015-08-11 12:58:21.277569 7f200fa8f780 5 asok(0x2800230) > register_command config show hook 0x27f0010 > -59> 2015-08-11 12:58:21.277573 7f200fa8f780 5 asok(0x2800230) > register_command config set hook 0x27f0010 > -58> 2015-08-11 12:58:21.277575 7f200fa8f780 5 asok(0x2800230) > register_command config get hook 0x27f0010 > -57> 2015-08-11 12:58:21.277578 7f200fa8f780 5 asok(0x2800230) > register_command log flush hook 0x27f0010 > -56> 2015-08-11 12:58:21.277581 7f200fa8f780 5 asok(0x2800230) > register_command log dump hook 0x27f0010 > -55> 2015-08-11 12:58:21.277583 7f200fa8f780 5 asok(0x2800230) > register_command log reopen hook 0x27f0010 > -54> 2015-08-11 12:58:21.279878 7f200fa8f780 0 ceph version 0.80.10 > (ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918 > -53> 2015-08-11 12:58:21.345764 7f200fa8f780 1 -- 10.17.0.7:0/0 learned > my addr 10.17.0.7:0/0 > -52> 2015-08-11 12:58:21.345778 7f200fa8f780 1 accepter.accepter.bind > my_inst.addr is 10.17.0.7:6813/31918 need_addr=0 > -51> 2015-08-11 12:58:21.345792 7f200fa8f780 1 -- 10.18.0.7:0/0 learned > my addr 10.18.0.7:0/0 > -50> 2015-08-11 12:58:21.345795 7f200fa8f780 1 accepter.accepter.bind > my_inst.addr is 10.18.0.7:6808/31918 need_addr=0 > -49> 2015-08-11 12:58:21.345805 7f200fa8f780 1 -- 10.18.0.7:0/0 learned > my addr 10.18.0.7:0/0 > -48> 2015-08-11 12:58:21.345809 7f200fa8f780 1 accepter.accepter.bind > my_inst.addr is 10.18.0.7:6809/31918 need_addr=0 > -47> 2015-08-11 12:58:21.345827 7f200fa8f780 1 -- 10.17.0.7:0/0 learned > my addr 10.17.0.7:0/0 > -46> 2015-08-11 12:58:21.345830 7f200fa8f780 1 accepter.accepter.bind > my_inst.addr is 10.17.0.7:6824/31918 need_addr=0 > -45> 2015-08-11 12:58:21.345847 7f200fa8f780 1 -- 10.17.0.7:0/0 learned > my addr 10.17.0.7:0/0 > -44> 2015-08-11 12:58:21.345851 7f200fa8f780 1 accepter.accepter.bind > my_inst.addr is 10.17.0.7:6825/31918 need_addr=0 > -43> 2015-08-11 12:58:21.346156 7f200fa8f780 1 finished > global_init_daemonize > -42> 2015-08-11 12:58:21.348094 7f200fa8f780 10 > filestore(/var/lib/ceph/osd/ceph-7) dump_stop > -41> 2015-08-11 12:58:21.348119 7f200fa8f780 5 asok(0x2800230) init > /var/run/ceph/ceph-osd.7.asok > -40> 2015-08-11 12:58:21.348134 7f200fa8f780 5 asok(0x2800230) > bind_and_listen /var/run/ceph/ceph-osd.7.asok > -39> 2015-08-11 12:58:21.348232 7f200fa8f780 5 asok(0x2800230) > register_command 0 hook 0x27ee0b0 > -38> 2015-08-11 12:58:21.348242 7f200fa8f780 5 asok(0x2800230) > register_command version hook 0x27ee0b0 > -37> 2015-08-11 12:58:21.348246 7f200fa8f780 5 asok(0x2800230) > register_command git_version hook 0x27ee0b0 > -36> 2015-08-11 12:58:21.348250 7f200fa8f780 5 asok(0x2800230) > register_command help hook 0x27f00b0 > -35> 2015-08-11 12:58:21.348254 7f200fa8f780 5 asok(0x2800230) > register_command get_command_descriptions hook 0x27f0150 > -34> 2015-08-11 12:58:21.348278 7f200b749700 5 asok(0x2800230) entry > start > -33> 2015-08-11 12:58:21.348291 7f200fa8f780 5 > filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal > /var/lib/ceph/osd/ceph-7/journal > -32> 2015-08-11 12:58:21.348326 7f200fa8f780 10 > filestore(/var/lib/ceph/osd/ceph-7) mount fsid is > 54c136da-c51c-4799-b2dc-b7988982ee00 > -31> 2015-08-11 12:58:21.349010 7f200fa8f780 0 > filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs) > -30> 2015-08-11 12:58:21.349026 7f200fa8f780 1 > filestore(/var/lib/ceph/osd/ceph-7) disabling 'filestore replica fadvise' > due to known issues with fadvise(DONTNEED) on xfs > -29> 2015-08-11 12:58:21.353277 7f200fa8f780 0 > genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP > ioctl is supported and appears to work > -28> 2015-08-11 12:58:21.353302 7f200fa8f780 0 > genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP > ioctl is disabled via 'filestore fiemap' config option > -27> 2015-08-11 12:58:21.362106 7f200fa8f780 0 > genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: > syscall(SYS_syncfs, fd) fully supported > -26> 2015-08-11 12:58:21.362195 7f200fa8f780 0 > xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is > disabled by conf > -25> 2015-08-11 12:58:21.362701 7f200fa8f780 5 > filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995 > -24> 2015-08-11 12:58:24.458593 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned > 1164 bytes > -23> 2015-08-11 12:58:24.462824 7f200b749700 1 do_command 'config get' > 'format:json var:fsid > -22> 2015-08-11 12:58:24.462850 7f200b749700 1 do_command 'config get' > 'format:json var:fsid result is 47 bytes > -21> 2015-08-11 12:58:24.462853 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'config get' '' to 0x27f0010 returned 47 bytes > -20> 2015-08-11 12:58:24.463194 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned > 1164 bytes > -19> 2015-08-11 12:58:24.467886 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'version' '' to 0x27ee0b0 returned 21 bytes > -18> 2015-08-11 12:58:34.118231 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned > 1164 bytes > -17> 2015-08-11 12:58:34.122484 7f200b749700 1 do_command 'config get' > 'format:json var:fsid > -16> 2015-08-11 12:58:34.122503 7f200b749700 1 do_command 'config get' > 'format:json var:fsid result is 47 bytes > -15> 2015-08-11 12:58:34.122506 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'config get' '' to 0x27f0010 returned 47 bytes > -14> 2015-08-11 12:58:34.122739 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned > 1164 bytes > -13> 2015-08-11 12:58:34.125503 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'version' '' to 0x27ee0b0 returned 21 bytes > -12> 2015-08-11 12:58:44.136424 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned > 1164 bytes > -11> 2015-08-11 12:58:44.140286 7f200b749700 1 do_command 'config get' > 'format:json var:fsid > -10> 2015-08-11 12:58:44.140304 7f200b749700 1 do_command 'config get' > 'format:json var:fsid result is 47 bytes > -9> 2015-08-11 12:58:44.140309 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'config get' '' to 0x27f0010 returned 47 bytes > -8> 2015-08-11 12:58:44.140530 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned > 1164 bytes > -7> 2015-08-11 12:58:44.143236 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'version' '' to 0x27ee0b0 returned 21 bytes > -6> 2015-08-11 12:58:54.493800 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned > 1164 bytes > -5> 2015-08-11 12:58:54.497564 7f200b749700 1 do_command 'config get' > 'format:json var:fsid > -4> 2015-08-11 12:58:54.497586 7f200b749700 1 do_command 'config get' > 'format:json var:fsid result is 47 bytes > -3> 2015-08-11 12:58:54.497591 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'config get' '' to 0x27f0010 returned 47 bytes > -2> 2015-08-11 12:58:54.497905 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'get_command_descriptions' '' to 0x27f0150 returned > 1164 bytes > -1> 2015-08-11 12:58:54.500762 7f200b749700 5 asok(0x2800230) > AdminSocket: request 'version' '' to 0x27ee0b0 returned 21 bytes > 0> 2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal > (Aborted) ** > in thread 7f200fa8f780 > > ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70) > 1: /usr/bin/ceph-osd() [0xab7562] > 2: (()+0xf0a0) [0x7f200efcd0a0] > 3: (gsignal()+0x35) [0x7f200db3f165] > 4: (abort()+0x180) [0x7f200db423e0] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d] > 6: (()+0x63996) [0x7f200e393996] > 7: (()+0x639c3) [0x7f200e3939c3] > 8: (()+0x63bee) [0x7f200e393bee] > 9: (tc_new()+0x48e) [0x7f200f213aee] > 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, > std::allocator<char> const&)+0x59) [0x7f200e3ef999] > 11: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned > long)+0x28) [0x7f200e3f0708] > 12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0] > 13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5] > 14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2) > [0x7f200f46ffa2] > 15: (leveldb::DBImpl::RecoverLogFile(unsigned long, leveldb::VersionEdit*, > unsigned long*)+0x180) [0x7f200f468360] > 16: (leveldb::DBImpl::Recover(leveldb::VersionEdit*)+0x5c2) > [0x7f200f46adf2] > 17: (leveldb::DB::Open(leveldb::Options const&, std::string const&, > leveldb::DB**)+0xff) [0x7f200f46b11f] > 18: (LevelDBStore::do_open(std::ostream&, bool)+0xd8) [0xa123a8] > 19: (FileStore::mount()+0x18e0) [0x9b7080] > 20: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78f52a] > 21: (main()+0x2234) [0x7331c4] > 22: (__libc_start_main()+0xfd) [0x7f200db2bead] > 23: /usr/bin/ceph-osd() [0x736e99] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 20/ 5 osd > 0/ 5 optracker > 0/ 5 objclass > 20/20 filestore > 1/ 3 keyvaluestore > 20/20 journal > 0/ 5 ms > 1/ 5 mon > 5/20 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 20/20 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.7.log > --- end dump of recent events --- > > > > > > > > > > > > > > > > > > > > > -- > > As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas pelo > sigilo legal e por direitos autorais. A divulgação, distribuição, reprodução > ou qualquer forma de utilização do teor deste documento depende de > autorização do emissor, sujeitando-se o infrator às sanções legais. Caso > esta comunicação tenha sido recebida por engano, favor avisar imediatamente, > respondendo esta mensagem. > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Best Regards, Wheat _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com