Well that caused some excitement (either that or the small power disruption did)! One of my OSDs is now down because it keeps crashing due to a failed assert (stacktraces attached, also I'm apparently running mimic, not luminous). In the past a failed assert on an OSD has meant removing the disk, wiping it, re-adding it as a new one, and then have ceph rebuild it from other copies of the data. I did this all manually in the past, but I'm trying to get more familiar with ceph's commands. Will the following commands do the same? ceph-volume lvm zap --destroy --osd-id 11 # Presumably that has to be run from the node with OSD 11, not just # any ceph node? # Source: http://docs.ceph.com/docs/mimic/ceph-volume/lvm/zap Do I need to remove the OSD (ceph osd out 11; wait for stabilization; ceph osd purge 11) before I do this and run and "ceph-deploy osd create" afterwards? Thanks, Adam On 6/26/19 6:35 AM, Paul Emmerich wrote: > Have you tried: ceph osd force-create-pg <pgid>? > > If that doesn't work: use objectstore-tool on the OSD (while it's not > running) and use it to force mark the PG as complete. (Don't know the > exact command off the top of my head) > > Caution: these are obviously really dangerous commands > > > > Paul > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io <http://www.croit.io> > Tel: +49 89 1896585 90 > > > On Wed, Jun 26, 2019 at 1:56 AM ☣Adam <adam@xxxxxxxxx > <mailto:adam@xxxxxxxxx>> wrote: > > How can I tell ceph to give up on "incomplete" PGs? > > I have 12 pgs which are "inactive, incomplete" that won't recover. I > think this is because in the past I have carelessly pulled disks too > quickly without letting the system recover. I suspect the disks that > have the data for these are long gone. > > Whatever the reason, I want to fix it so I have a clean cluser even if > that means losing data. > > I went through the "troubleshooting pgs" guide[1] which is excellent, > but didn't get me to a fix. > > The output of `ceph pg 2.0 query` includes this: > "recovery_state": [ > { > "name": "Started/Primary/Peering/Incomplete", > "enter_time": "2019-06-25 18:35:20.306634", > "comment": "not enough complete instances of this PG" > }, > > I've already restated all OSDs in various orders, and I changed min_size > to 1 to see if that would allow them to get fixed, but no such luck. > These pools are not erasure coded and I'm using the Luminous release. > > How can I tell ceph to give up on these PGs? There's nothing identified > as unfound, so mark_unfound_lost doesn't help. I feel like `ceph osd > lost` might be it, but at this point the OSD numbers have been reused > for new disks, so I'd really like to limit the damage to the 12 PGs > which are incomplete if possible. > > Thanks, > Adam > > [1] > http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f7372987b5e] 2: (()+0x2c4cb7) [0x7f7372987cb7] 3: (PG::check_past_interval_bounds() const+0xae5) [0x564b8db12f05] 4: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x1bb) [0x564b8db43f5b] 5: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x200) [0x564b8db92430] 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x4b) [0x564b8db65a4b] 7: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x213) [0x564b8db27ca3] 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x2b4) [0x564b8da92fa4] 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xb4) [0x564b8da93704] 10: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x52) [0x564b8dcee862] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x926) [0x564b8daa0c26] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d6) [0x7f737298c666] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f737298dce0] 14: (()+0x76db) [0x7f73710296db] 15: (clone()+0x3f) [0x7f736fff288f] -1143> 2019-06-26 08:56:54.398 7f73529f5700 -1 *** Caught signal (Aborted) ** in thread 7f73529f5700 thread_name:tp_osd_tp ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (()+0x12890) [0x7f7371034890] 2: (gsignal()+0xc7) [0x7f736ff0fe97] 3: (abort()+0x141) [0x7f736ff11801] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x25f) [0x7f7372987c6f] 5: (()+0x2c4cb7) [0x7f7372987cb7] 6: (PG::check_past_interval_bounds() const+0xae5) [0x564b8db12f05] 7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x1bb) [0x564b8db43f5b] 8: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x200) [0x564b8db92430] 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x4b) [0x564b8db65a4b] 10: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x213) [0x564b8db27ca3] 11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x2b4) [0x564b8da92fa4] 12: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xb4) [0x564b8da93704] 13: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x52) [0x564b8dcee862] 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x926) [0x564b8daa0c26] 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d6) [0x7f737298c666] 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f737298dce0] 17: (()+0x76db) [0x7f73710296db] 18: (clone()+0x3f) [0x7f736fff288f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com