After a change in our crushmap, to support different rulesets for different disk technologies, we run into some weird behavior problems, such as OSDs appearing in wrong places in the tree, after restarting them. During such an attempt, we had some PGs which would not get synchronized, whatever we tried. At some point we brought the whole cluster down and then up, and after this, our mon services are not starting. We get the following error in our logs: 2014-12-02 11:50:42.917880 7f97f0b02780 1 mon.staging-rd0-00@-1(probing).paxosservice(auth 31517..31635) refresh upgraded, format 0 -> 1 2014-12-02 11:50:42.918408 7f97f0b02780 1 -- 62.217.119.10:6789/0 messenger.start 2014-12-02 11:50:42.918555 7f97f0b02780 1 accepter.accepter.start 2014-12-02 11:50:42.918598 7f97f0b02780 0 mon.staging-rd0-00@-1(probing) e11 my rank is now 0 (was -1) 2014-12-02 11:50:42.918607 7f97f0b02780 1 -- 62.217.119.10:6789/0 mark_down_all 2014-12-02 11:50:42.918623 7f97f0b02780 1 mon.staging-rd0-00@0(probing) e11 win_standalone_election 2014-12-02 11:50:42.919188 7f97f0b02780 0 log [INF] : mon.staging-rd0-00@0 won leader election with quorum 0 2014-12-02 11:50:42.919219 7f97f0b02780 1 -- 62.217.119.10:6789/0 --> mon.0 62.217.119.10:6789/0 -- log(1 entries) v1 -- ?+0 0x1fee900 2014-12-02 11:50:42.919273 7f97e8b41700 1 -- 62.217.119.10:6789/0 <== mon.0 62.217.119.10:6789/0 0 ==== log(1 entries) v1 ==== 0+0+0 (0 0 0) 0x1fee900 con 0x1f9e420 2014-12-02 11:50:42.919333 7f97f0b02780 0 log [INF] : monmap e11: 1 mons at {staging-rd0-00=62.217.119.10:6789/0} 2014-12-02 11:50:42.919363 7f97f0b02780 1 -- 62.217.119.10:6789/0 --> mon.0 62.217.119.10:6789/0 -- log(1 entries) v1 -- ?+0 0x1fee6c0 2014-12-02 11:50:42.919485 7f97f0b02780 0 log [INF] : pgmap v10676711: 10240 pgs: 67 stale+active+remapped, 3416 stale+active+clean, 42 active+remapped, 2640 stale+incomplete, 4075 stale+active+degraded; 498 GB data, 1329 GB used, 76132 GB / 77461 GB avail; 719/2885120 objects degraded (0.025%) 2014-12-02 11:50:42.919507 7f97f0b02780 1 -- 62.217.119.10:6789/0 --> mon.0 62.217.119.10:6789/0 -- log(1 entries) v1 -- ?+0 0x1fee480 2014-12-02 11:50:42.919564 7f97f0b02780 0 log [INF] : mdsmap e1: 0/0/1 up 2014-12-02 11:50:42.919579 7f97f0b02780 1 -- 62.217.119.10:6789/0 --> mon.0 62.217.119.10:6789/0 -- log(1 entries) v1 -- ?+0 0x1fee240 2014-12-02 11:50:42.923552 7f97f0b02780 -1 *** Caught signal (Segmentation fault) ** in thread 7f97f0b02780 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: /usr/bin/ceph-mon() [0x90b642] 2: (()+0xf0a0) [0x7f97f02430a0] 3: /usr/bin/ceph-mon() [0x8872dd] 4: (crush_do_rule()+0x39e) [0x887a7e] 5: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0x88) [0x7ecb48] 6: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*, int*, unsigned int*) const+0x79) [0x7daec9] 7: (OSDMap::pg_to_raw_up(pg_t, std::vector<int, std::allocator<int> >*, int*) const+0xa4) [0x7db364] 8: (OSDMap::remove_redundant_temporaries(CephContext*, OSDMap const&, OSDMap::Incremental*)+0x251) [0x7db661] 9: (OSDMonitor::create_pending()+0xca) [0x6889ea] 10: (PaxosService::_active()+0x649) [0x6683f9] 11: (PaxosService::election_finished()+0x7c) [0x6685ec] 12: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x216) [0x630396] 13: (Monitor::win_standalone_election()+0x165) [0x6306b5] 14: (Monitor::bootstrap()+0xa5b) [0x63122b] 15: (Monitor::init()+0x8c) [0x63307c] 16: (main()+0x1fec) [0x5f38ec] 17: (__libc_start_main()+0xfd) [0x7f97eeda0ead] 18: /usr/bin/ceph-mon() [0x5f6bb9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -64> 2014-12-02 11:50:42.749326 7f97f0b02780 5 asok(0x1f52d20) register_command perfcounters_dump hook 0x1f40010 -63> 2014-12-02 11:50:42.749351 7f97f0b02780 5 asok(0x1f52d20) register_command 1 hook 0x1f40010 -62> 2014-12-02 11:50:42.749356 7f97f0b02780 5 asok(0x1f52d20) register_command perf dump hook 0x1f40010 -61> 2014-12-02 11:50:42.749362 7f97f0b02780 5 asok(0x1f52d20) register_command perfcounters_schema hook 0x1f40010 -60> 2014-12-02 11:50:42.749365 7f97f0b02780 5 asok(0x1f52d20) register_command 2 hook 0x1f40010 -59> 2014-12-02 11:50:42.749368 7f97f0b02780 5 asok(0x1f52d20) register_command perf schema hook 0x1f40010 -58> 2014-12-02 11:50:42.749372 7f97f0b02780 5 asok(0x1f52d20) register_command config show hook 0x1f40010 -57> 2014-12-02 11:50:42.749375 7f97f0b02780 5 asok(0x1f52d20) register_command config set hook 0x1f40010 -56> 2014-12-02 11:50:42.749379 7f97f0b02780 5 asok(0x1f52d20) register_command config get hook 0x1f40010 -55> 2014-12-02 11:50:42.749382 7f97f0b02780 5 asok(0x1f52d20) register_command log flush hook 0x1f40010 -54> 2014-12-02 11:50:42.749397 7f97f0b02780 5 asok(0x1f52d20) register_command log dump hook 0x1f40010 -53> 2014-12-02 11:50:42.749401 7f97f0b02780 5 asok(0x1f52d20) register_command log reopen hook 0x1f40010 -52> 2014-12-02 11:50:42.752557 7f97f0b02780 0 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3), process ceph-mon, pid 8596 -51> 2014-12-02 11:50:42.756688 7f97f0b02780 5 asok(0x1f52d20) init /var/run/ceph/ceph-mon.staging-rd0-00.asok -50> 2014-12-02 11:50:42.756710 7f97f0b02780 5 asok(0x1f52d20) bind_and_listen /var/run/ceph/ceph-mon.staging-rd0-00.asok -49> 2014-12-02 11:50:42.756856 7f97f0b02780 5 asok(0x1f52d20) register_command 0 hook 0x1f3e0b8 -48> 2014-12-02 11:50:42.756866 7f97f0b02780 5 asok(0x1f52d20) register_command version hook 0x1f3e0b8 -47> 2014-12-02 11:50:42.756870 7f97f0b02780 5 asok(0x1f52d20) register_command git_version hook 0x1f3e0b8 -46> 2014-12-02 11:50:42.756874 7f97f0b02780 5 asok(0x1f52d20) register_command help hook 0x1f400a0 -45> 2014-12-02 11:50:42.756878 7f97f0b02780 5 asok(0x1f52d20) register_command get_command_descriptions hook 0x1f40140 -44> 2014-12-02 11:50:42.756943 7f97ec9be700 5 asok(0x1f52d20) entry start -43> 2014-12-02 11:50:42.772103 7f97f0b02780 0 starting mon.staging-rd0-00 rank 0 at 62.217.119.10:6789/0 mon_data /rados/mon/mon_staging-rd0-00 fsid 2c91375c-6926-4a96-a2b6-f154fbbe70d4 -42> 2014-12-02 11:50:42.772202 7f97f0b02780 1 -- 62.217.119.10:6789/0 learned my addr 62.217.119.10:6789/0 -41> 2014-12-02 11:50:42.772217 7f97f0b02780 1 accepter.accepter.bind my_inst.addr is 62.217.119.10:6789/0 need_addr=0 -40> 2014-12-02 11:50:42.772243 7f97f0b02780 5 adding auth protocol: none -39> 2014-12-02 11:50:42.772247 7f97f0b02780 5 adding auth protocol: none -38> 2014-12-02 11:50:42.772321 7f97f0b02780 1 mon.staging-rd0-00@-1(probing) e11 preinit fsid 2c91375c-6926-4a96-a2b6-f154fbbe70d4 -37> 2014-12-02 11:50:42.788977 7f97f0b02780 1 mon.staging-rd0-00@-1(probing).paxosservice(pgmap 10676090..10676711) refresh upgraded, format 0 -> 1 -36> 2014-12-02 11:50:42.789000 7f97f0b02780 1 mon.staging-rd0-00@-1(probing).pg v0 on_upgrade discarding in-core PGMap -35> 2014-12-02 11:50:42.915470 7f97f0b02780 0 mon.staging-rd0-00@-1(probing).mds e1 print_map epoch 1 flags 0 created 2013-02-18 10:41:12.941121 modified 2013-02-18 10:41:12.941163 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 last_failure 0 last_failure_osd_epoch 0 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object} max_mds 1 in up {} failed stopped data_pools 0 metadata_pool 1 inline_data disabled -34> 2014-12-02 11:50:42.917880 7f97f0b02780 1 mon.staging-rd0-00@-1(probing).paxosservice(auth 31517..31635) refresh upgraded, format 0 -> 1 -33> 2014-12-02 11:50:42.918351 7f97f0b02780 5 asok(0x1f52d20) register_command mon_status hook 0x1f405f0 -32> 2014-12-02 11:50:42.918364 7f97f0b02780 5 asok(0x1f52d20) register_command quorum_status hook 0x1f405f0 -31> 2014-12-02 11:50:42.918369 7f97f0b02780 5 asok(0x1f52d20) register_command sync_force hook 0x1f405f0 -30> 2014-12-02 11:50:42.918372 7f97f0b02780 5 asok(0x1f52d20) register_command add_bootstrap_peer_hint hook 0x1f405f0 -29> 2014-12-02 11:50:42.918376 7f97f0b02780 5 asok(0x1f52d20) register_command quorum enter hook 0x1f405f0 -28> 2014-12-02 11:50:42.918379 7f97f0b02780 5 asok(0x1f52d20) register_command quorum exit hook 0x1f405f0 -27> 2014-12-02 11:50:42.918393 7f97f0b02780 1 finished global_init_daemonize -26> 2014-12-02 11:50:42.918408 7f97f0b02780 1 -- 62.217.119.10:6789/0 messenger.start -25> 2014-12-02 11:50:42.918555 7f97f0b02780 1 accepter.accepter.start -24> 2014-12-02 11:50:42.918598 7f97f0b02780 0 mon.staging-rd0-00@-1(probing) e11 my rank is now 0 (was -1) -23> 2014-12-02 11:50:42.918607 7f97f0b02780 1 -- 62.217.119.10:6789/0 mark_down_all -22> 2014-12-02 11:50:42.918623 7f97f0b02780 1 mon.staging-rd0-00@0(probing) e11 win_standalone_election -21> 2014-12-02 11:50:42.919188 7f97f0b02780 0 log [INF] : mon.staging-rd0-00@0 won leader election with quorum 0 -20> 2014-12-02 11:50:42.919196 7f97f0b02780 10 send_log to self -19> 2014-12-02 11:50:42.919198 7f97f0b02780 10 log_queue is 1 last_log 1 sent 0 num 1 unsent 1 sending 1 -18> 2014-12-02 11:50:42.919202 7f97f0b02780 10 will send 2014-12-02 11:50:42.919196 mon.0 62.217.119.10:6789/0 1 : [INF] mon.staging-rd0-00@0 won leader election with quorum 0 -17> 2014-12-02 11:50:42.919219 7f97f0b02780 1 -- 62.217.119.10:6789/0 --> mon.0 62.217.119.10:6789/0 -- log(1 entries) v1 -- ?+0 0x1fee900 -16> 2014-12-02 11:50:42.919273 7f97e8b41700 1 -- 62.217.119.10:6789/0 <== mon.0 62.217.119.10:6789/0 0 ==== log(1 entries) v1 ==== 0+0+0 (0 0 0) 0x1fee900 con 0x1f9e420 -15> 2014-12-02 11:50:42.919333 7f97f0b02780 0 log [INF] : monmap e11: 1 mons at {staging-rd0-00=62.217.119.10:6789/0} -14> 2014-12-02 11:50:42.919338 7f97f0b02780 10 send_log to self -13> 2014-12-02 11:50:42.919340 7f97f0b02780 10 log_queue is 2 last_log 2 sent 1 num 2 unsent 1 sending 1 -12> 2014-12-02 11:50:42.919345 7f97f0b02780 10 will send 2014-12-02 11:50:42.919338 mon.0 62.217.119.10:6789/0 2 : [INF] monmap e11: 1 mons at {staging-rd0-00=62.217.119.10:6789/0} -11> 2014-12-02 11:50:42.919363 7f97f0b02780 1 -- 62.217.119.10:6789/0 --> mon.0 62.217.119.10:6789/0 -- log(1 entries) v1 -- ?+0 0x1fee6c0 -10> 2014-12-02 11:50:42.919485 7f97f0b02780 0 log [INF] : pgmap v10676711: 10240 pgs: 67 stale+active+remapped, 3416 stale+active+clean, 42 active+remapped, 2640 stale+incomplete, 4075 stale+active+degraded; 498 GB data, 1329 GB used, 76132 GB / 77461 GB avail; 719/2885120 objects degraded (0.025%) -9> 2014-12-02 11:50:42.919493 7f97f0b02780 10 send_log to self -8> 2014-12-02 11:50:42.919495 7f97f0b02780 10 log_queue is 3 last_log 3 sent 2 num 3 unsent 1 sending 1 -7> 2014-12-02 11:50:42.919497 7f97f0b02780 10 will send 2014-12-02 11:50:42.919493 mon.0 62.217.119.10:6789/0 3 : [INF] pgmap v10676711: 10240 pgs: 67 stale+active+remapped, 3416 stale+active+clean, 42 active+remapped, 2640 stale+incomplete, 4075 stale+active+degraded; 498 GB data, 1329 GB used, 76132 GB / 77461 GB avail; 719/2885120 objects degraded (0.025%) -6> 2014-12-02 11:50:42.919507 7f97f0b02780 1 -- 62.217.119.10:6789/0 --> mon.0 62.217.119.10:6789/0 -- log(1 entries) v1 -- ?+0 0x1fee480 -5> 2014-12-02 11:50:42.919564 7f97f0b02780 0 log [INF] : mdsmap e1: 0/0/1 up -4> 2014-12-02 11:50:42.919567 7f97f0b02780 10 send_log to self -3> 2014-12-02 11:50:42.919568 7f97f0b02780 10 log_queue is 4 last_log 4 sent 3 num 4 unsent 1 sending 1 -2> 2014-12-02 11:50:42.919570 7f97f0b02780 10 will send 2014-12-02 11:50:42.919567 mon.0 62.217.119.10:6789/0 4 : [INF] mdsmap e1: 0/0/1 up -1> 2014-12-02 11:50:42.919579 7f97f0b02780 1 -- 62.217.119.10:6789/0 --> mon.0 62.217.119.10:6789/0 -- log(1 entries) v1 -- ?+0 0x1fee240 0> 2014-12-02 11:50:42.923552 7f97f0b02780 -1 *** Caught signal (Segmentation fault) ** in thread 7f97f0b02780 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: /usr/bin/ceph-mon() [0x90b642] 2: (()+0xf0a0) [0x7f97f02430a0] 3: /usr/bin/ceph-mon() [0x8872dd] 4: (crush_do_rule()+0x39e) [0x887a7e] 5: (CrushWrapper::do_rule(int, int, std::vector<int, std::allocator<int> >&, int, std::vector<unsigned int, std::allocator<unsigned int> > const&) const+0x88) [0x7ecb48] 6: (OSDMap::_pg_to_osds(pg_pool_t const&, pg_t, std::vector<int, std::allocator<int> >*, int*, unsigned int*) const+0x79) [0x7daec9] 7: (OSDMap::pg_to_raw_up(pg_t, std::vector<int, std::allocator<int> >*, int*) const+0xa4) [0x7db364] 8: (OSDMap::remove_redundant_temporaries(CephContext*, OSDMap const&, OSDMap::Incremental*)+0x251) [0x7db661] 9: (OSDMonitor::create_pending()+0xca) [0x6889ea] 10: (PaxosService::_active()+0x649) [0x6683f9] 11: (PaxosService::election_finished()+0x7c) [0x6685ec] 12: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x216) [0x630396] 13: (Monitor::win_standalone_election()+0x165) [0x6306b5] 14: (Monitor::bootstrap()+0xa5b) [0x63122b] 15: (Monitor::init()+0x8c) [0x63307c] 16: (main()+0x1fec) [0x5f38ec] 17: (__libc_start_main()+0xfd) [0x7f97eeda0ead] 18: /usr/bin/ceph-mon() [0x5f6bb9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 1/ 1 ms 1/ 1 mon 0/10 monc 1/ 1 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mon.staging-rd0-00.log --- end dump of recent events --- This is a staging environment, but as we are trying to create a procedure for a huge migration, we need to be absolutely sure of what happened and correct our procedures. We would like to know whether anyone has any suggestion about how to move next and find out what happened to the cluster. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com