Hi, to answer my question and for the record: It turned out that the "device_health_metrics" pool was using PG 7.0 which had no objects after removing the pool but the PG was somehow not deleted/removed. [root@cephosd-05 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-98 --pgid 7.0 --op list Did not show any objects. Issuing the command ( thanks Elise for sharing your experience ) on all three osds, successfully removed the PG. [root@cephosd-05 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-98 --op export-remove --pgid 7.0 --file pg_7_0_from_osd_98.bin Exporting 7.0 info 7.0( v 9010'82 (0'0,9010'82] lb MIN (bitwise) local-lis/les=9008/9009 n=41 ec=9008/9008 lis/c 9008/9008 les/c/f 9009/9009/0 9019/9019/9019) Export successful marking collection for removal setting '_remove' omap key finish_remove_pgs 7.0_head removing 7.0 Remove successful Now I could start the three OSDs again and the cluster is HEALTHY. I hope this gets fixed soon, meanwhile one should keep this in mind and be careful when trying the ceph device monitoring and deleting the device_health_metrics pool. Best Dietmar On 5/3/19 10:09 PM, Dietmar Rieder wrote: > HI, > > I think I just hit the sam problem on Nautilus 14.2.1 > I tested the ceph device monitoring, which created a new pool > (device_health_metrics), after looking into the monitoring feature, I > turned it off again and removed the pool. This resulted int 3 OSDs down > which can not be started again since they keep crashing. > > How can I reenable the OSDs? > > I think the following is the relevant log. > > 2019-05-03 21:24:05.265 7f8e96b8a700 -1 > bluestore(/var/lib/ceph/osd/ceph-206) _txc_add_transaction error (39) > Directory not empty not handled on operation 21 (op 1, counting from 0) > 2019-05-03 21:24:05.265 7f8e96b8a700 0 > bluestore(/var/lib/ceph/osd/ceph-206) _dump_transaction transaction dump: > { > "ops": [ > { > "op_num": 0, > "op_name": "remove", > "collection": "7.0_head", > "oid": "#7:00000000::::head#" > }, > { > "op_num": 1, > "op_name": "rmcoll", > "collection": "7.0_head" > } > ] > } > > 2019-05-03 21:24:05.269 7f8e96b8a700 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2 > .1/rpm/el7/BUILD/ceph-14.2.1/src/os/bluestore/BlueStore.cc: In function > 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*)' thread 7f8e96b8a700 tim > e 2019-05-03 21:24:05.266723 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/os/blue > store/BlueStore.cc: 11089: abort() > > ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus > (stable) > 1: (ceph::__ceph_abort(char const*, int, char const*, std::string > const&)+0xd8) [0x55766fbd6cd0] > 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*)+0x2a85) [0x5576701b6af5] > 3: > (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, > std::vector<ObjectStore::Transaction, > std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ > ptr<TrackedOp>, ThreadPool::TPHandle*)+0x526) [0x5576701b7866] > 4: > (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, > ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, > ThreadPool::TPHandle*)+0x7f) [0x55766f > d9274f] > 5: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0x55766fdf577d] > 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) > [0x55766fdf6598] > 7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, > PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_: > :na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > (boost::statechart::history_mode)0>::react_impl(boost::statec > hart::event_base const&, void const*)+0x16a) [0x55766fe355ca] > 8: > (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, > PG::RecoveryState::Initial, std::allocator<void>, > boost::statechart::null_exception_translator>::process_event(bo > ost::statechart::event_base const&)+0x5a) [0x55766fe130ca] > 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, > PG::RecoveryCtx*)+0x119) [0x55766fe02389] > 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, > std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) > [0x55766fd3c3c4] > 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, > ThreadPool::TPHandle&)+0x234) [0x55766fd3c804] > 12: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x9f4) [0x55766fd30b44] > 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) > [0x55767032ae93] > 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55767032df30] > 15: (()+0x7dd5) [0x7f8eb7162dd5] > 16: (clone()+0x6d) [0x7f8eb6028ead] > > 2019-05-03 21:24:05.274 7f8e96b8a700 -1 *** Caught signal (Aborted) ** > in thread 7f8e96b8a700 thread_name:tp_osd_tp > > ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus > (stable) > 1: (()+0xf5d0) [0x7f8eb716a5d0] > 2: (gsignal()+0x37) [0x7f8eb5f61207] > 3: (abort()+0x148) [0x7f8eb5f628f8] > 4: (ceph::__ceph_abort(char const*, int, char const*, std::string > const&)+0x19c) [0x55766fbd6d94] > 5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*)+0x2a85) [0x5576701b6af5] > 6: > (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, > std::vector<ObjectStore::Transaction, > std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ > ptr<TrackedOp>, ThreadPool::TPHandle*)+0x526) [0x5576701b7866] > 7: > (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, > ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, > ThreadPool::TPHandle*)+0x7f) [0x55766f > d9274f] > 8: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0x55766fdf577d] > 9: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) > [0x55766fdf6598] > 10: (boost::statechart::simple_state<PG::RecoveryState::Deleting, > PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_ > ::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > (boost::statechart::history_mode)0>::react_impl(boost::state > chart::event_base const&, void const*)+0x16a) [0x55766fe355ca] > 11: > (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, > PG::RecoveryState::Initial, std::allocator<void>, > boost::statechart::null_exception_translator>::process_event(b > oost::statechart::event_base const&)+0x5a) [0x55766fe130ca] > 12: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, > PG::RecoveryCtx*)+0x119) [0x55766fe02389] > 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, > std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) > [0x55766fd3c3c4] > 14: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, > ThreadPool::TPHandle&)+0x234) [0x55766fd3c804] > 15: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x9f4) [0x55766fd30b44] > 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) > [0x55767032ae93] > 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55767032df30] > 18: (()+0x7dd5) [0x7f8eb7162dd5] > 19: (clone()+0x6d) [0x7f8eb6028ead] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > > ~Best > Dietmar > > On 4/29/19 11:05 PM, Gregory Farnum wrote: >> Glad you got it working and thanks for the logs! Looks like we've seen >> this once or twice before so I added them to >> https://tracker.ceph.com/issues/38724. >> -Greg >> >> On Fri, Apr 26, 2019 at 5:52 PM Elise Burke <elise.null@xxxxxxxxx> wrote: >>> >>> Thanks for the pointer to ceph-objectstore-tool, it turns out that removing and exporting the PG from all three disks was enough to make it boot! I've exported the three copies of the bad PG, let me know if you'd like me to upload them anywhere for inspection. >>> >>> All data has been recovered (since I was originally removing the pool that contained pg 25.0 anyway) and all systems are go on my end. Ceph's architecture is very solid. >>> >>> >>> $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-6 --op export-remove --pgid 25.0 --file pg_25_0_from_osd_6.bin >>> Exporting 25.0 info 25.0( v 7592'106 (0'0,7592'106] lb MIN (bitwise) local-lis/les=7488/7489 n=9 ec=5191/5191 lis/c 7488/7488 les/c/f 7489/7489/0 7593/7593/7593) >>> Export successful >>> marking collection for removal >>> setting '_remove' omap key >>> finish_remove_pgs 25.0_head removing 25.0 >>> Remove successful >>> >>> >>> On Fri, Apr 26, 2019 at 8:33 PM Elise Burke <elise.null@xxxxxxxxx> wrote: >>>> >>>> Using ceph-objectstore-info on PG 25.0 (which indeed, was the one I remember having the error) shows this: >>>> >>>> struct_v 10 >>>> { >>>> "pgid": "25.0", >>>> "last_update": "7592'106", >>>> "last_complete": "7592'106", >>>> "log_tail": "0'0", >>>> "last_user_version": 106, >>>> "last_backfill": "MIN", >>>> "last_backfill_bitwise": 1, >>>> "purged_snaps": [], >>>> "history": { >>>> "epoch_created": 5191, >>>> "epoch_pool_created": 5191, >>>> "last_epoch_started": 7489, >>>> "last_interval_started": 7488, >>>> "last_epoch_clean": 7489, >>>> "last_interval_clean": 7488, >>>> "last_epoch_split": 0, >>>> "last_epoch_marked_full": 0, >>>> "same_up_since": 7593, >>>> "same_interval_since": 7593, >>>> "same_primary_since": 7593, >>>> "last_scrub": "7592'106", >>>> "last_scrub_stamp": "2019-04-25 21:34:52.079721", >>>> "last_deep_scrub": "7485'70", >>>> "last_deep_scrub_stamp": "2019-04-22 10:15:40.532014", >>>> "last_clean_scrub_stamp": "2019-04-19 14:52:44.047548" >>>> }, >>>> "stats": { >>>> "version": "7592'105", >>>> "reported_seq": "2621", >>>> "reported_epoch": "7592", >>>> "state": "active+clean+inconsistent", >>>> "last_fresh": "2019-04-25 20:02:55.620028", >>>> "last_change": "2019-04-24 19:52:45.072473", >>>> "last_active": "2019-04-25 20:02:55.620028", >>>> "last_peered": "2019-04-25 20:02:55.620028", >>>> "last_clean": "2019-04-25 20:02:55.620028", >>>> "last_became_active": "2019-04-22 17:55:37.578239", >>>> "last_became_peered": "2019-04-22 17:55:37.578239", >>>> "last_unstale": "2019-04-25 20:02:55.620028", >>>> "last_undegraded": "2019-04-25 20:02:55.620028", >>>> "last_fullsized": "2019-04-25 20:02:55.620028", >>>> "mapping_epoch": 7593, >>>> "log_start": "0'0", >>>> "ondisk_log_start": "0'0", >>>> "created": 5191, >>>> "last_epoch_clean": 7489, >>>> "parent": "0.0", >>>> "parent_split_bits": 0, >>>> "last_scrub": "7592'88", >>>> "last_scrub_stamp": "2019-04-24 19:52:45.072367", >>>> "last_deep_scrub": "7485'70", >>>> "last_deep_scrub_stamp": "2019-04-22 10:15:40.532014", >>>> "last_clean_scrub_stamp": "2019-04-19 14:52:44.047548", >>>> "log_size": 105, >>>> "ondisk_log_size": 105, >>>> "stats_invalid": false, >>>> "dirty_stats_invalid": false, >>>> "omap_stats_invalid": false, >>>> "hitset_stats_invalid": false, >>>> "hitset_bytes_stats_invalid": false, >>>> "pin_stats_invalid": false, >>>> "manifest_stats_invalid": false, >>>> "snaptrimq_len": 0, >>>> "stat_sum": { >>>> "num_bytes": 0, >>>> "num_objects": 9, >>>> "num_object_clones": 0, >>>> "num_object_copies": 27, >>>> "num_objects_missing_on_primary": 0, >>>> "num_objects_missing": 0, >>>> "num_objects_degraded": 0, >>>> "num_objects_misplaced": 0, >>>> "num_objects_unfound": 0, >>>> "num_objects_dirty": 9, >>>> "num_whiteouts": 0, >>>> "num_read": 87, >>>> "num_read_kb": 87, >>>> "num_write": 98, >>>> "num_write_kb": 98, >>>> "num_scrub_errors": 0, >>>> "num_shallow_scrub_errors": 0, >>>> "num_deep_scrub_errors": 0, >>>> "num_objects_recovered": 0, >>>> "num_bytes_recovered": 0, >>>> "num_keys_recovered": 0, >>>> "num_objects_omap": 9, >>>> "num_objects_hit_set_archive": 0, >>>> "num_bytes_hit_set_archive": 0, >>>> "num_flush": 0, >>>> "num_flush_kb": 0, >>>> "num_evict": 0, >>>> "num_evict_kb": 0, >>>> "num_promote": 0, >>>> "num_flush_mode_high": 0, >>>> "num_flush_mode_low": 0, >>>> "num_evict_mode_some": 0, >>>> "num_evict_mode_full": 0, >>>> "num_objects_pinned": 0, >>>> "num_legacy_snapsets": 0, >>>> "num_large_omap_objects": 0, >>>> "num_objects_manifest": 0, >>>> "num_omap_bytes": 0, >>>> "num_omap_keys": 0 >>>> }, >>>> "up": [], >>>> "acting": [], >>>> "blocked_by": [], >>>> "up_primary": -1, >>>> "acting_primary": -1, >>>> "purged_snaps": [] >>>> }, >>>> "empty": 0, >>>> "dne": 0, >>>> "incomplete": 1, >>>> "last_epoch_started": 7489, >>>> "hit_set_history": { >>>> "current_last_update": "0'0", >>>> "history": [] >>>> } >>>> } >>>> >>>> >>>> On Fri, Apr 26, 2019 at 7:35 PM Elise Burke <elise.null@xxxxxxxxx> wrote: >>>>> >>>>> Thanks for the suggestions. I've uploaded the surprisingly large (1.5G!) log file: ceph-post-file: 2d8d22f4-580b-4b57-a13a-f49dade34ba7 >>>>> >>>>> Looks like these are the relevant lines: >>>>> >>>>> -52> 2019-04-26 19:23:05.190 7fb2657dc700 20 osd.2 op_wq(2) _process empty q, waiting >>>>> -51> 2019-04-26 19:23:05.190 7fb2667de700 20 osd.2 op_wq(0) _process OpQueueItem(25.0 PGDelete(25.0 e7648) prio 5 cost 1048576 e7648) queued >>>>> -50> 2019-04-26 19:23:05.190 7fb2717f4700 20 bluestore(/var/lib/ceph/osd/ceph-2) _kv_finalize_thread sleep >>>>> -49> 2019-04-26 19:23:05.190 7fb2667de700 20 osd.2 op_wq(0) _process 25.0 to_process <OpQueueItem(25.0 PGDelete(25.0 e7648) prio 5 cost 1048576 e7648)> waiting <> waiting_peering {} >>>>> -48> 2019-04-26 19:23:05.190 7fb2667de700 20 osd.2 op_wq(0) _process OpQueueItem(25.0 PGDelete(25.0 e7648) prio 5 cost 1048576 e7648) pg 0x7c20a00000 >>>>> -47> 2019-04-26 19:23:05.190 7fb2667de700 10 osd.2 pg_epoch: 7648 pg[25.0( v 7592'106 (0'0,7592'106] lb MIN (bitwise) local-lis/les=7488/7489 n=9 ec=5191/5191 lis/c 7488/7488 les/c/f 7489/7489/0 7593/7593/7593) [] r=-1 lpr=7648 D >>>>> ELETING pi=[7488,7593)/1 crt=7592'106 lcod 0'0 unknown NOTIFY mbc={}] do_peering_event: epoch_sent: 7648 epoch_requested: 7648 DeleteSome >>>>> -46> 2019-04-26 19:23:05.190 7fb2667de700 10 osd.2 pg_epoch: 7648 pg[25.0( v 7592'106 (0'0,7592'106] lb MIN (bitwise) local-lis/les=7488/7489 n=9 ec=5191/5191 lis/c 7488/7488 les/c/f 7489/7489/0 7593/7593/7593) [] r=-1 lpr=7648 D >>>>> ELETING pi=[7488,7593)/1 crt=7592'106 lcod 0'0 unknown NOTIFY mbc={}] _delete_some >>>>> -45> 2019-04-26 19:23:05.190 7fb2667de700 15 bluestore(/var/lib/ceph/osd/ceph-2) collection_list 25.0_head start GHMIN end GHMAX max 30 >>>>> -44> 2019-04-26 19:23:05.190 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _collection_list range 0x7f7fffffffffffffe500000000 to 0x7f7fffffffffffffe5ffffffff and 0x7f800000000000001900000000 to 0x7f8000000000000019ffffffff >>>>> start GHMIN >>>>> -43> 2019-04-26 19:23:05.191 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _collection_list pend 0x7f7fffffffffffffe5ffffffff >>>>> -42> 2019-04-26 19:23:05.191 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _collection_list key 0x7f7fffffffffffffff100020'P!osdmap.7114!='0x0000000000000000ffffffffffffffff'o' >= GHMAX >>>>> -41> 2019-04-26 19:23:05.191 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _collection_list oid #25:00000000::::head# end GHMAX >>>>> -40> 2019-04-26 19:23:05.191 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _collection_list oid #25:b08b92bd::::head# end GHMAX >>>>> -39> 2019-04-26 19:23:05.191 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _collection_list key 0x80800000000000000c1c00000021213dfffffffffffffffeffffffffffffffff'o' >= GHMAX >>>>> -38> 2019-04-26 19:23:05.191 7fb2667de700 10 bluestore(/var/lib/ceph/osd/ceph-2) collection_list 25.0_head start GHMAX end GHMAX max 30 = 0, ls.size() = 2, next = GHMAX >>>>> -37> 2019-04-26 19:23:05.191 7fb2667de700 20 osd.2 pg_epoch: 7648 pg[25.0( v 7592'106 (0'0,7592'106] lb MIN (bitwise) local-lis/les=7488/7489 n=9 ec=5191/5191 lis/c 7488/7488 les/c/f 7489/7489/0 7593/7593/7593) [] r=-1 lpr=7648 D >>>>> ELETING pi=[7488,7593)/1 crt=7592'106 lcod 0'0 unknown NOTIFY mbc={}] _delete_some [#25:00000000::::head#,#25:b08b92bd::::head#] >>>>> -36> 2019-04-26 19:23:05.191 7fb2667de700 20 osd.2 pg_epoch: 7648 pg[25.0( v 7592'106 (0'0,7592'106] lb MIN (bitwise) local-lis/les=7488/7489 n=9 ec=5191/5191 lis/c 7488/7488 les/c/f 7489/7489/0 7593/7593/7593) [] r=-1 lpr=7648 D >>>>> ELETING pi=[7488,7593)/1 crt=7592'106 lcod 0'0 unknown NOTIFY mbc={}] _delete_some finished >>>>> -35> 2019-04-26 19:23:05.191 7fb2667de700 10 bluestore(/var/lib/ceph/osd/ceph-2) queue_transactions ch 0x7c27d94b40 25.0_head >>>>> -34> 2019-04-26 19:23:05.191 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _txc_create osr 0x7c274f6c60 = 0x7c2ef34c00 seq 3 >>>>> -33> 2019-04-26 19:23:05.191 7fb2667de700 15 bluestore(/var/lib/ceph/osd/ceph-2) _remove 25.0_head #25:00000000::::head# onode 0x7c20aaf200 txc 0x7c2ef34c00 >>>>> -32> 2019-04-26 19:23:05.191 7fb2667de700 15 bluestore(/var/lib/ceph/osd/ceph-2) _do_truncate 25.0_head #25:00000000::::head# 0x0 >>>>> -31> 2019-04-26 19:23:05.191 7fb2667de700 20 bluestore.onode(0x7c20aaf200).flush flush done >>>>> -30> 2019-04-26 19:23:05.191 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _do_omap_clear remove range start: 0x0000000000b0295a'-' end: 0x0000000000b0295a'~' >>>>> -29> 2019-04-26 19:23:05.191 7fb2667de700 10 bluestore(/var/lib/ceph/osd/ceph-2) _remove 25.0_head #25:00000000::::head# = 0 >>>>> -28> 2019-04-26 19:23:05.191 7fb2667de700 15 bluestore(/var/lib/ceph/osd/ceph-2) _remove_collection 25.0_head >>>>> -27> 2019-04-26 19:23:05.191 7fb2667de700 20 bluestore.OnodeSpace(0x7c27d94c98 in 0x7c1f7e10a0) map_any >>>>> -26> 2019-04-26 19:23:05.191 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _collection_list range 0x7f7fffffffffffffe500000000 to 0x7f7fffffffffffffe5ffffffff and 0x7f800000000000001900000000 to 0x7f8000000000000019ffffffff >>>>> start GHMIN >>>>> -25> 2019-04-26 19:23:05.192 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _collection_list pend 0x7f7fffffffffffffe5ffffffff >>>>> -24> 2019-04-26 19:23:05.192 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _collection_list key 0x7f7fffffffffffffff100020'P!osdmap.7114!='0x0000000000000000ffffffffffffffff'o' >= GHMAX >>>>> -23> 2019-04-26 19:23:05.192 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _collection_list oid #25:00000000::::head# end GHMAX >>>>> -22> 2019-04-26 19:23:05.192 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _collection_list oid #25:b08b92bd::::head# end GHMAX >>>>> -21> 2019-04-26 19:23:05.192 7fb2667de700 20 bluestore(/var/lib/ceph/osd/ceph-2) _collection_list key 0x80800000000000000c1c00000021213dfffffffffffffffeffffffffffffffff'o' >= GHMAX >>>>> -20> 2019-04-26 19:23:05.192 7fb2667de700 10 bluestore(/var/lib/ceph/osd/ceph-2) _remove_collection oid #25:00000000::::head# >>>>> -19> 2019-04-26 19:23:05.192 7fb2667de700 10 bluestore(/var/lib/ceph/osd/ceph-2) _remove_collection oid #25:b08b92bd::::head# >>>>> -18> 2019-04-26 19:23:05.192 7fb2667de700 10 bluestore(/var/lib/ceph/osd/ceph-2) _remove_collection #25:b08b92bd::::head# exists in db, not present in ram >>>>> -17> 2019-04-26 19:23:05.192 7fb2667de700 10 bluestore(/var/lib/ceph/osd/ceph-2) _remove_collection 25.0_head is non-empty >>>>> -16> 2019-04-26 19:23:05.192 7fb2667de700 10 bluestore(/var/lib/ceph/osd/ceph-2) _remove_collection 25.0_head = -39 >>>>> -15> 2019-04-26 19:23:05.192 7fb2667de700 -1 bluestore(/var/lib/ceph/osd/ceph-2) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) >>>>> -14> 2019-04-26 19:23:05.192 7fb2667de700 0 bluestore(/var/lib/ceph/osd/ceph-2) _dump_transaction transaction dump: >>>>> { >>>>> "ops": [ >>>>> { >>>>> "op_num": 0, >>>>> "op_name": "remove", >>>>> "collection": "25.0_head", >>>>> "oid": "#25:00000000::::head#" >>>>> }, >>>>> { >>>>> "op_num": 1, >>>>> "op_name": "rmcoll", >>>>> "collection": "25.0_head" >>>>> } >>>>> >>>>> -13> 2019-04-26 19:23:05.199 7fb2667de700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7fb2667de700 time 2019-04-26 19:23:05.193826 >>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluestore/BlueStore.cc: 11069: abort() >>>>> >>>>> ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable) >>>>> 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xd8) [0x7c1454ee40] >>>>> 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x2a85) [0x7c14b2d5f5] >>>>> 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x526) [0x7c14b2e366] >>>>> 4: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0x7c1470a81f] >>>>> 5: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0x7c1476d70d] >>>>> 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) [0x7c1476e528] >>>>> 7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x16a) [0x7c147acc8a] >>>>> 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5a) [0x7c1478a91a] >>>>> 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x119) [0x7c14779c99] >>>>> 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x7c146b4494] >>>>> 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0x234) [0x7c146b48d4] >>>>> 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x7c146a8c14] >>>>> 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x7c14ca0f43] >>>>> 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7c14ca3fe0] >>>>> 15: (()+0x7dd5) [0x7fb284e3bdd5] >>>>> 16: (clone()+0x6d) [0x7fb283d01ead] >>>>> >>>>> -12> 2019-04-26 19:23:05.199 7fb280b24700 1 -- [v2:192.168.1.152:6800/1887364,v1:192.168.1.152:6801/1887364] <== client.244681 v1:192.168.1.50:0/337348516 6 ==== osd_op(client.244681.0:2571786 12.5es0 12.43dfa65e (undecoded) ondisk+retry+write+known_if_redirected e7648) v8 ==== 229+0+4194304 (unknown 2519985655 0 813231621) 0x7c2f3ee840 con 0x7c2f2fac00 >>>>> -11> 2019-04-26 19:23:05.199 7fb280b24700 15 osd.2 7648 enqueue_op 0x7c300f8420 prio 63 cost 4194304 latency 0.035786 epoch 7648 osd_op(client.244681.0:2571786 12.5es0 12.43dfa65e (undecoded) ondisk+retry+write+known_if_redirected e7648) v8 >>>>> -10> 2019-04-26 19:23:05.199 7fb280b24700 20 osd.2 op_wq(4) _enqueue OpQueueItem(12.5es0 PGOpItem(op=osd_op(client.244681.0:2571786 12.5es0 12.43dfa65e (undecoded) ondisk+retry+write+known_if_redirected e7648) v8) prio 63 cost 4194304 e7648) >>>>> -9> 2019-04-26 19:23:05.200 7fb2647da700 20 osd.2 op_wq(4) _process 12.5es0 to_process <> waiting <> waiting_peering {} >>>>> -8> 2019-04-26 19:23:05.200 7fb2647da700 20 osd.2 op_wq(4) _process OpQueueItem(12.5es0 PGOpItem(op=osd_op(client.244681.0:2571786 12.5es0 12.43dfa65e (undecoded) ondisk+retry+write+known_if_redirected e7648) v8) prio 63 cost 4194304 e7648) queued >>>>> -7> 2019-04-26 19:23:05.200 7fb2647da700 20 osd.2 op_wq(4) _process 12.5es0 to_process <OpQueueItem(12.5es0 PGOpItem(op=osd_op(client.244681.0:2571786 12.5es0 12.43dfa65e (undecoded) ondisk+retry+write+known_if_redirected e7648) v8) prio 63 cost 4194304 e7648)> waiting <> waiting_peering {} >>>>> -6> 2019-04-26 19:23:05.200 7fb2647da700 20 osd.2 op_wq(4) _process OpQueueItem(12.5es0 PGOpItem(op=osd_op(client.244681.0:2571786 12.5es0 12.43dfa65e (undecoded) ondisk+retry+write+known_if_redirected e7648) v8) prio 63 cost 4194304 e7648) pg 0x7c271c6000 >>>>> -5> 2019-04-26 19:23:05.200 7fb2647da700 10 osd.2 7648 dequeue_op 0x7c300f8420 prio 63 cost 4194304 latency 0.036428 osd_op(client.244681.0:2571786 12.5es0 12.43dfa65e (undecoded) ondisk+retry+write+known_if_redirected e7648) v8 pg pg[12.5es0( v 7592'250137 (5872'250020,7592'250137] local-lis/les=7488/7489 n=22756 ec=5088/733 lis/c 7488/7488 les/c/f 7489/7489/5043 7648/7648/7648) [2,2147483647,0,7,4]p2(0) r=0 lpr=7648 pi=[7488,7648)/1 crt=7592'250135 lcod 0'0 mlcod 0'0 peering mbc={}] >>>>> -4> 2019-04-26 19:23:05.200 7fb2647da700 20 osd.2 7648 share_map client.244681 v1:192.168.1.50:0/337348516 7648 >>>>> -3> 2019-04-26 19:23:05.200 7fb2647da700 20 osd.2 7648 should_share_map client.244681 v1:192.168.1.50:0/337348516 7648 >>>>> -2> 2019-04-26 19:23:05.200 7fb2647da700 10 osd.2 7648 dequeue_op 0x7c300f8420 finish >>>>> -1> 2019-04-26 19:23:05.200 7fb2647da700 20 osd.2 op_wq(4) _process empty q, waiting >>>>> 0> 2019-04-26 19:23:05.206 7fb2667de700 -1 *** Caught signal (Aborted) ** >>>>> in thread 7fb2667de700 thread_name:tp_osd_tp >>>>> >>>>> >>>>> On Fri, Apr 26, 2019 at 3:57 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >>>>>> >>>>>> You'll probably want to generate a log with "debug osd = 20" and >>>>>> "debug bluestore = 20", then share that or upload it with >>>>>> ceph-post-file, to get more useful info about which PGs are breaking >>>>>> (is it actually the ones that were supposed to delete?). >>>>>> >>>>>> If there's a particular set of PGs you need to rescue, you can also >>>>>> look at using the ceph-objectstore-tool to export them off the busted >>>>>> OSD stores and import them into OSDs that still work. >>>>>> >>>>>> >>>>>> On Fri, Apr 26, 2019 at 12:01 PM Elise Burke <elise.null@xxxxxxxxx> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I upgraded to Nautilus a week or two ago and things had been mostly fine. I was interested in trying the device health stats feature and enabled it. In doing so it created a pool, device_health_metrics, which contained zero bytes. >>>>>>> >>>>>>> Unfortunately this pool developed a PG that could not be repaired with `ceph pg repair`. That's okay, I thought, this pool is empty (zero bytes), so I'll just remove it and discard the PG entirely. >>>>>>> >>>>>>> So I did: `ceph osd pool rm device_health_metrics device_health_metrics --yes-i-really-really-mean-it` >>>>>>> >>>>>>> Within a few seconds three OSDs had gone missing (this pool was size=3) and now crashloop at startup. >>>>>>> >>>>>>> Any assistance in getting these OSDs up (such as by discarding the errant PG) would be appreciated. I'm most concerned about the other pools in the system, as losing three OSDs at once has not been ideal. >>>>>>> >>>>>>> This is made more difficult as these are in the Bluestore configuration and were set up with ceph-deploy to bare metal (using LVM mode). >>>>>>> >>>>>>> Here's the traceback as noted in journalctl: >>>>>>> >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: -5381> 2019-04-26 11:01:08.902 7f8a00866d80 -1 Falling back to public interface >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: -4241> 2019-04-26 11:01:41.835 7f8a00866d80 -1 osd.2 7630 log_to_monitors {default=true} >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: -3> 2019-04-26 11:01:43.203 7f89dee53700 -1 bluestore(/var/lib/ceph/osd/ceph-2) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: -1> 2019-04-26 11:01:43.209 7f89dee53700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14 >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluest >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable) >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xd8) [0xfc63afe40] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x2a85) [0xfc698e5f5] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 4: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xfc656b81f >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 5: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0xfc65ce70d] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) [0xfc65cf528] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x119) [0xfc65dac99] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0xfc6515494] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0x234) [0xfc65158d4] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0xfc6509c14] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0xfc6b01f43] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6b04fe0] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 15: (()+0x7dd5) [0x7f89fd4b0dd5] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 16: (clone()+0x6d) [0x7f89fc376ead] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 0> 2019-04-26 11:01:43.217 7f89dee53700 -1 *** Caught signal (Aborted) ** >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: in thread 7f89dee53700 thread_name:tp_osd_tp >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable) >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 1: (()+0xf5d0) [0x7f89fd4b85d0] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 2: (gsignal()+0x37) [0x7f89fc2af207] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 3: (abort()+0x148) [0x7f89fc2b08f8] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 4: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x19c) [0xfc63aff04] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x2a85) [0xfc698e5f5] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 7: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xfc656b81f >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 8: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0xfc65ce70d] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 9: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) [0xfc65cf528] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 10: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::n >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boos >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 12: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x119) [0xfc65dac99] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0xfc6515494] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 14: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0x234) [0xfc65158d4] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0xfc6509c14] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0xfc6b01f43] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6b04fe0] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 18: (()+0x7dd5) [0x7f89fd4b0dd5] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: 19: (clone()+0x6d) [0x7f89fc376ead] >>>>>>> Apr 26 11:01:43 databox ceph-osd[1878533]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> -Elise >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- _________________________________________ D i e t m a r R i e d e r, Mag.Dr. Innsbruck Medical University Biocenter - Division for Bioinformatics Innrain 80, 6020 Innsbruck Phone: +43 512 9003 71402 Fax: +43 512 9003 73100 Email: dietmar.rieder@xxxxxxxxxxx Web: http://www.icbi.at
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com