Hello! Unfortunately, our single-node-"Cluster" with 11 ODSs is broken because some ODSs crash when they start peering. I'm on Ubuntu 18.04 with Ceph Mimic (13.2.2). The problem was induced by when RAM was filled up and ODS processes then crashed because of memory allocation failures. No weird commands (e.g. force_create_pg) were used on this cluster and it was set up with 13.2.1 initially. The affected pool seems to be a replicated pool with size=3 and min_size=2 (which haven't been changed). Crash log of osd.4 (only the crashed thread): 99424: -1577> 2018-10-04 13:40:11.024 7f3838417700 10 log is not dirty 99425: -1576> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 1433 queue_want_up_thru want 1433 <= queued 1433, currently 1426 99427: -1574> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 3.8 to_process <> waiting <> waiting_peering {} 99428: -1573> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process OpQueueItem(3.8 PGPeeringEvent(epoch_sent: 1433 epoch_requested: 1433 MNotifyRec 3.8 from 2 notify: (query:1433 sent:1433 3.8( v 866'122691 (569'119300,866'122691] local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433)) features: 0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 0,2),([1308,1311] acting 4,10),([1401,1403] acting 2,10),([1426,1428] acting 2,4)) +create_info) prio 255 cost 10 e1433) queued 99430: -1571> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 3.8 to_process <OpQueueItem(3.8 PGPeeringEvent(epoch_sent: 1433 epoch_requested: 1433 MNotifyRec 3.8 from 2 notify: (query:1433 sent:1433 3.8( v 866'122691 (569'119300,866'122691] local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433)) features: 0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 0,2),([1308,1311] acting 4,10),([1401,1403] acting 2,10),([1426,1428] acting 2,4)) +create_info) prio 255 cost 10 e1433)> waiting <> waiting_peering {} 99433: -1568> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process OpQueueItem(3.8 PGPeeringEvent(epoch_sent: 1433 epoch_requested: 1433 MNotifyRec 3.8 from 2 notify: (query:1433 sent:1433 3.8( v 866'122691 (569'119300,866'122691] local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433)) features: 0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 0,2),([1308,1311] acting 4,10),([1401,1403] acting 2,10),([1426,1428] acting 2,4)) +create_info) prio 255 cost 10 e1433) pg 0x56013bc87400 99437: -1564> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774] local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433 pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 1433 epoch_requested: 1433 MNotifyRec 3.8 from 2 notify: (query:1433 sent:1433 3.8( v 866'122691 (569'119300,866'122691] local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433)) features: 0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 0,2),([1308,1311] acting 4,10),([1401,1403] acting 2,10),([1426,1428] acting 2,4)) +create_info 99440: -1561> 2018-10-04 13:40:11.024 7f3838417700 7 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774] local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433 pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary>: handle_pg_notify from osd.2 99444: -1557> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774] local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433 pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] got dup osd.2 info 3.8( v 866'122691 (569'119300,866'122691] local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433), identical to ours 99445: -1556> 2018-10-04 13:40:11.024 7f3838417700 10 log is not dirty 99446: -1555> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 1433 queue_want_up_thru want 1433 <= queued 1433, currently 1426 99448: -1553> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 3.8 to_process <> waiting <> waiting_peering {} 99450: -1551> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process OpQueueItem(3.8 PGPeeringEvent(epoch_sent: 1433 epoch_requested: 1433 MLogRec from 2 +create_info) prio 255 cost 10 e1433) queued 99456: -1545> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 3.8 to_process <OpQueueItem(3.8 PGPeeringEvent(epoch_sent: 1433 epoch_requested: 1433 MLogRec from 2 +create_info) prio 255 cost 10 e1433)> waiting <> waiting_peering {} 99458: -1543> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process OpQueueItem(3.8 PGPeeringEvent(epoch_sent: 1433 epoch_requested: 1433 MLogRec from 2 +create_info) prio 255 cost 10 e1433) pg 0x56013bc87400 99461: -1540> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774] local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433 pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 1433 epoch_requested: 1433 MLogRec from 2 +create_info 99465: -1536> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774] local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433 pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetLog>: GetLog: received master log from osd2 99467: -1534> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774] local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433 pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetLog>: leaving GetLog 99473: -1529> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774] local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433 pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetLog>: processing master log 99477: -1525> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774] local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433 pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] proc_master_log for osd.2: log((866'122691,866'122691], crt=866'122691) missing(0 may_include_deletes = 1) 99480: -1522> 2018-10-04 13:40:11.024 7f3838417700 10 merge_log log((866'122691,866'122691], crt=866'122691) from osd.2 into log((866'124700,866'127774], crt=866'127774) 101018: -1> 2018-10-04 13:40:11.028 7f3838417700 -1 /home/jj/cephsrc/ceph-13.2.2/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f3838417700 time 2018-10-04 13:40:11.027763 101109:2018-10-04 13:40:11.084 7f3838417700 -1 *** Caught signal (Aborted) ** -1> 2018-10-04 13:40:11.028 7f3838417700 -1 /home/jj/cephsrc/ceph-13.2.2/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool /home/jj/cephsrc/ceph-13.2.2/src/osd/PGLog.cc: 345: FAILED assert(log.head >= olog.tail && olog.head >= log.tail) ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f385abdc822] 2: (()+0x26c9e7) [0x7f385abdc9e7] 3: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1ad0) [0x5601185c7f50] 4: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x64) [0x560118525194] 5: (PG::proc_master_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_missing_set<false>&, pg_shard_t)+0x8b) [0x560118566abb] 6: (PG::RecoveryState::GetLog::react(PG::RecoveryState::GotLog const&)+0xab) [0x560118566e4b] 7: (boost::statechart::simple_state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xb3) [0x56011858 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_ba 10: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x143) [0x560118573a53] 11: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xcf) [0x5601184b450f] 12: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x560118720820] 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x598) [0x5601184c4148] 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x46e) [0x7f385abe175e] 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f385abe37e0] 16: (()+0x76db) [0x7f38592bf6db] 17: (clone()+0x3f) [0x7f385827188f] So it seems the contact to osd.2 for pg 3.8 leads to the crash. pool 3 'ssn_root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 147 flags hashpspool stripe_width 0 application cephfs The pg dump for 3.8: 3.8 56570 0 0 0 0 0 3074 3074 down 2018-10-04 18:21:59.450788 866'127774 2554:129697 [4] 4 [4] 4 541'94186 2018-10-02 21:57:31.049984 427'34480 2018-09-22 18:09:40.485861 0 When I stop osd.2, It crashes with: -20> 2018-10-04 18:20:20.624 7f0168bef700 10 osd.4 pg_epoch: 2540 pg[3.8( v 866'127774 (866'124700,866'127774] local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 2540/2540/2537) [4,10] r=0 lpr=2540 pi=[859,2540)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] proc_master_log for osd.10: log((866'122691,866'122691], crt=866'122691) missing(382 may_include_deletes = 1) -19> 2018-10-04 18:20:20.624 7f0168bef700 10 merge_log log((866'122691,866'122691], crt=866'122691) from osd.10 into log((866'124700,866'127774], crt=866'127774) [same traceback] So I stop osd.10. Then osd.4 keeps running. pg 3.8 query says this: "recovery_state": [ { "name": "Started/Primary/Peering/Down", "enter_time": "2018-10-04 18:21:59.450774", "comment": "not enough up instances of this PG to go active" }, { "name": "Started/Primary/Peering", "enter_time": "2018-10-04 18:21:59.450460", "past_intervals": [ { "first": "859", "last": "2550", "all_participants": [ { "osd": 0 }, { "osd": 2 }, { "osd": 4 }, { "osd": 10 } ], [...] "probing_osds": [ "4" ], "blocked": "peering is blocked due to down osds", "down_osds_we_would_probe": [ 0, 2, 10 ], "peering_blocked_by": [ { "osd": 2, "current_lost_at": 0, "comment": "starting or marking this osd lost may let us proceed" }, { "osd": 10, "current_lost_at": 0, "comment": "starting or marking this osd lost may let us proceed" } ] What shall I do? :) Cheers, JJ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com