Re: Nautilus (14.2.0) OSDs crashing at startup after removing a pool containing a PG with an unrepairable error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



You'll probably want to generate a log with "debug osd = 20" and
"debug bluestore = 20", then share that or upload it with
ceph-post-file, to get more useful info about which PGs are breaking
(is it actually the ones that were supposed to delete?).

If there's a particular set of PGs you need to rescue, you can also
look at using the ceph-objectstore-tool to export them off the busted
OSD stores and import them into OSDs that still work.


On Fri, Apr 26, 2019 at 12:01 PM Elise Burke <elise.null@xxxxxxxxx> wrote:
>
> Hi,
>
> I upgraded to Nautilus a week or two ago and things had been mostly fine. I was interested in trying the device health stats feature and enabled it. In doing so it created a pool, device_health_metrics, which contained zero bytes.
>
> Unfortunately this pool developed a PG that could not be repaired with `ceph pg repair`. That's okay, I thought, this pool is empty (zero bytes), so I'll just remove it and discard the PG entirely.
>
> So I did: `ceph osd pool rm device_health_metrics device_health_metrics --yes-i-really-really-mean-it`
>
> Within a few seconds three OSDs had gone missing (this pool was size=3) and now crashloop at startup.
>
> Any assistance in getting these OSDs up (such as by discarding the errant PG) would be appreciated. I'm most concerned about the other pools in the system, as losing three OSDs at once has not been ideal.
>
> This is made more difficult as these are in the Bluestore configuration and were set up with ceph-deploy to bare metal (using LVM mode).
>
> Here's the traceback as noted in journalctl:
>
> Apr 26 11:01:43 databox ceph-osd[1878533]: -5381> 2019-04-26 11:01:08.902 7f8a00866d80 -1 Falling back to public interface
> Apr 26 11:01:43 databox ceph-osd[1878533]: -4241> 2019-04-26 11:01:41.835 7f8a00866d80 -1 osd.2 7630 log_to_monitors {default=true}
> Apr 26 11:01:43 databox ceph-osd[1878533]: -3> 2019-04-26 11:01:43.203 7f89dee53700 -1 bluestore(/var/lib/ceph/osd/ceph-2) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
> Apr 26 11:01:43 databox ceph-osd[1878533]: -1> 2019-04-26 11:01:43.209 7f89dee53700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14
> Apr 26 11:01:43 databox ceph-osd[1878533]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluest
> Apr 26 11:01:43 databox ceph-osd[1878533]: ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
> Apr 26 11:01:43 databox ceph-osd[1878533]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xd8) [0xfc63afe40]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x2a85) [0xfc698e5f5]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr
> Apr 26 11:01:43 databox ceph-osd[1878533]: 4: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xfc656b81f
> Apr 26 11:01:43 databox ceph-osd[1878533]: 5: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0xfc65ce70d]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) [0xfc65cf528]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na
> Apr 26 11:01:43 databox ceph-osd[1878533]: 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost
> Apr 26 11:01:43 databox ceph-osd[1878533]: 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x119) [0xfc65dac99]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0xfc6515494]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0x234) [0xfc65158d4]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0xfc6509c14]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0xfc6b01f43]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6b04fe0]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 15: (()+0x7dd5) [0x7f89fd4b0dd5]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 16: (clone()+0x6d) [0x7f89fc376ead]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 0> 2019-04-26 11:01:43.217 7f89dee53700 -1 *** Caught signal (Aborted) **
> Apr 26 11:01:43 databox ceph-osd[1878533]: in thread 7f89dee53700 thread_name:tp_osd_tp
> Apr 26 11:01:43 databox ceph-osd[1878533]: ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
> Apr 26 11:01:43 databox ceph-osd[1878533]: 1: (()+0xf5d0) [0x7f89fd4b85d0]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 2: (gsignal()+0x37) [0x7f89fc2af207]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 3: (abort()+0x148) [0x7f89fc2b08f8]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 4: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0x19c) [0xfc63aff04]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x2a85) [0xfc698e5f5]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr
> Apr 26 11:01:43 databox ceph-osd[1878533]: 7: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xfc656b81f
> Apr 26 11:01:43 databox ceph-osd[1878533]: 8: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0xfc65ce70d]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 9: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) [0xfc65cf528]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 10: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::n
> Apr 26 11:01:43 databox ceph-osd[1878533]: 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boos
> Apr 26 11:01:43 databox ceph-osd[1878533]: 12: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x119) [0xfc65dac99]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0xfc6515494]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 14: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0x234) [0xfc65158d4]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0xfc6509c14]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0xfc6b01f43]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6b04fe0]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 18: (()+0x7dd5) [0x7f89fd4b0dd5]
> Apr 26 11:01:43 databox ceph-osd[1878533]: 19: (clone()+0x6d) [0x7f89fc376ead]
> Apr 26 11:01:43 databox ceph-osd[1878533]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
> Thanks!
>
> -Elise
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux