Nautilus (14.2.0) OSDs crashing at startup after removing a pool containing a PG with an unrepairable error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I upgraded to Nautilus a week or two ago and things had been mostly fine. I was interested in trying the device health stats feature and enabled it. In doing so it created a pool, device_health_metrics, which contained zero bytes.

Unfortunately this pool developed a PG that could not be repaired with `ceph pg repair`. That's okay, I thought, this pool is empty (zero bytes), so I'll just remove it and discard the PG entirely.

So I did: `ceph osd pool rm device_health_metrics device_health_metrics --yes-i-really-really-mean-it`

Within a few seconds three OSDs had gone missing (this pool was size=3) and now crashloop at startup.

Any assistance in getting these OSDs up (such as by discarding the errant PG) would be appreciated. I'm most concerned about the other pools in the system, as losing three OSDs at once has not been ideal.

This is made more difficult as these are in the Bluestore configuration and were set up with ceph-deploy to bare metal (using LVM mode).

Here's the traceback as noted in journalctl:

Apr 26 11:01:43 databox ceph-osd[1878533]: -5381> 2019-04-26 11:01:08.902 7f8a00866d80 -1 Falling back to public interface
Apr 26 11:01:43 databox ceph-osd[1878533]: -4241> 2019-04-26 11:01:41.835 7f8a00866d80 -1 osd.2 7630 log_to_monitors {default=true}
Apr 26 11:01:43 databox ceph-osd[1878533]: -3> 2019-04-26 11:01:43.203 7f89dee53700 -1 bluestore(/var/lib/ceph/osd/ceph-2) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
Apr 26 11:01:43 databox ceph-osd[1878533]: -1> 2019-04-26 11:01:43.209 7f89dee53700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14
Apr 26 11:01:43 databox ceph-osd[1878533]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluest
Apr 26 11:01:43 databox ceph-osd[1878533]: ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
Apr 26 11:01:43 databox ceph-osd[1878533]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xd8) [0xfc63afe40]
Apr 26 11:01:43 databox ceph-osd[1878533]: 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x2a85) [0xfc698e5f5]
Apr 26 11:01:43 databox ceph-osd[1878533]: 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr
Apr 26 11:01:43 databox ceph-osd[1878533]: 4: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xfc656b81f
Apr 26 11:01:43 databox ceph-osd[1878533]: 5: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0xfc65ce70d]
Apr 26 11:01:43 databox ceph-osd[1878533]: 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) [0xfc65cf528]
Apr 26 11:01:43 databox ceph-osd[1878533]: 7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na
Apr 26 11:01:43 databox ceph-osd[1878533]: 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost
Apr 26 11:01:43 databox ceph-osd[1878533]: 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x119) [0xfc65dac99]
Apr 26 11:01:43 databox ceph-osd[1878533]: 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0xfc6515494]
Apr 26 11:01:43 databox ceph-osd[1878533]: 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0x234) [0xfc65158d4]
Apr 26 11:01:43 databox ceph-osd[1878533]: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0xfc6509c14]
Apr 26 11:01:43 databox ceph-osd[1878533]: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0xfc6b01f43]
Apr 26 11:01:43 databox ceph-osd[1878533]: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6b04fe0]
Apr 26 11:01:43 databox ceph-osd[1878533]: 15: (()+0x7dd5) [0x7f89fd4b0dd5]
Apr 26 11:01:43 databox ceph-osd[1878533]: 16: (clone()+0x6d) [0x7f89fc376ead]
Apr 26 11:01:43 databox ceph-osd[1878533]: 0> 2019-04-26 11:01:43.217 7f89dee53700 -1 *** Caught signal (Aborted) **
Apr 26 11:01:43 databox ceph-osd[1878533]: in thread 7f89dee53700 thread_name:tp_osd_tp
Apr 26 11:01:43 databox ceph-osd[1878533]: ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
Apr 26 11:01:43 databox ceph-osd[1878533]: 1: (()+0xf5d0) [0x7f89fd4b85d0]
Apr 26 11:01:43 databox ceph-osd[1878533]: 2: (gsignal()+0x37) [0x7f89fc2af207]
Apr 26 11:01:43 databox ceph-osd[1878533]: 3: (abort()+0x148) [0x7f89fc2b08f8]
Apr 26 11:01:43 databox ceph-osd[1878533]: 4: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x19c) [0xfc63aff04]
Apr 26 11:01:43 databox ceph-osd[1878533]: 5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x2a85) [0xfc698e5f5]
Apr 26 11:01:43 databox ceph-osd[1878533]: 6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr
Apr 26 11:01:43 databox ceph-osd[1878533]: 7: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xfc656b81f
Apr 26 11:01:43 databox ceph-osd[1878533]: 8: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0xfc65ce70d]
Apr 26 11:01:43 databox ceph-osd[1878533]: 9: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) [0xfc65cf528]
Apr 26 11:01:43 databox ceph-osd[1878533]: 10: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::n
Apr 26 11:01:43 databox ceph-osd[1878533]: 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boos
Apr 26 11:01:43 databox ceph-osd[1878533]: 12: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x119) [0xfc65dac99]
Apr 26 11:01:43 databox ceph-osd[1878533]: 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0xfc6515494]
Apr 26 11:01:43 databox ceph-osd[1878533]: 14: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0x234) [0xfc65158d4]
Apr 26 11:01:43 databox ceph-osd[1878533]: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0xfc6509c14]
Apr 26 11:01:43 databox ceph-osd[1878533]: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0xfc6b01f43]
Apr 26 11:01:43 databox ceph-osd[1878533]: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6b04fe0]
Apr 26 11:01:43 databox ceph-osd[1878533]: 18: (()+0x7dd5) [0x7f89fd4b0dd5]
Apr 26 11:01:43 databox ceph-osd[1878533]: 19: (clone()+0x6d) [0x7f89fc376ead]
Apr 26 11:01:43 databox ceph-osd[1878533]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Thanks!

-Elise
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux