Re: Help ! how to recover from total monitor failure in lumnious

Frank Li <frli@xxxxxxxxxxxxxxxxxxxx> · Fri, 2 Feb 2018 17:54:39 +0000

Yes, I was dealing with an issue where OSD are not peerings, and I was trying to see if force-create-pg can help recover the peering.
Data lose is an accepted  possibility.

 I hope this is what you are looking for ?

    -3> 2018-01-31 22:47:22.942394 7fc641d0b700  5 mon.dl1-kaf101@0(electing) e6 _ms_dispatch setting monitor caps on this connection
    -2> 2018-01-31 22:47:22.942405 7fc641d0b700  5 mon.dl1-kaf101@0(electing).paxos(paxos recovering c 28110997..28111530) is_readable = 0 - now=2018-01-31 22:47:22.942405 lease_expire=0.000000 has v0 lc 28111530
    -1> 2018-01-31 22:47:22.942422 7fc641d0b700  5 mon.dl1-kaf101@0(electing).paxos(paxos recovering c 28110997..28111530) is_readable = 0 - now=2018-01-31 22:47:22.942422 lease_expire=0.000000 has v0 lc 28111530
     0> 2018-01-31 22:47:22.955415 7fc64350e700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/OSDMapMapping.h: In function 'void OSDMapMapping::get(pg_t, std::vector<int>*, int*, std::vector<int>*, int*) const' thread 7fc64350e700 time 2018-01-31 22:47:22.952877
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/OSDMapMapping.h: 288: FAILED assert(pgid.ps() < p->second.pg_num)

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)

-- 
Efficiency is Intelligent Laziness
On 2/2/18, 9:45 AM, "Sage Weil" <sage@xxxxxxxxxxxx> wrote:

    On Fri, 2 Feb 2018, Frank Li wrote:
    > Hi, I ran the ceph osd force-create-pg command in luminious 12.2.2 to recover a failed pg, and it
    > Instantly caused all of the monitor to crash, is there anyway to revert back to an earlier state of the cluster ?
    > Right now, the monitors refuse to come up, the error message is as follows:
    > I’ve filed a ceph ticket for the crash, but just wonder if there is a way to get the cluster back up ?
    > 
    > https://urldefense.proofpoint.com/v2/url?u=https-3A__tracker.ceph.com_issues_22847&d=DwIDaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=8-PrUTevTN6k7Tl3nH9Gm-Cd_teurkDKr3VHRc5ZqM4&m=nOL-K3EredRTMr3uV0U4iTOCflIKxQgqNo52DGEPY0w&s=2QqLfmo9DbNVtMebeV-jKg5RC4oVx4vcIXSC8vDB88A&e=

    Can you includ the bit of the log a few lines up that includes the 
    assertion and file line number that failed?

    Also, "during the course of trouble-shooting an osd issue" makes me 
    nervous: force-create-pg creates a new, *empty* PG when all copies of the 
    old one have been lost.  Is that what you meant to do?  It is essentially 
    telling the system to give up and accepting that there is data loss.  Is 
    that what you meant?

    Thanks!
    sage

    > 
    > --- begin dump of recent events ---
    >      0> 2018-01-31 22:47:22.959665 7fc64350e700 -1 *** Caught signal (Aborted) **
    > in thread 7fc64350e700 thread_name:cpu_tp
    > 
    > ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
    > 1: (()+0x8eae11) [0x55f1113fae11]
    > 2: (()+0xf5e0) [0x7fc64aafa5e0]
    > 3: (gsignal()+0x37) [0x7fc647fca1f7]
    > 4: (abort()+0x148) [0x7fc647fcb8e8]
    > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x55f1110fa4a4]
    > 6: (()+0x2ccc4e) [0x55f110ddcc4e]
    > 7: (OSDMonitor::update_creating_pgs()+0x98b) [0x55f11102232b]
    > 8: (C_UpdateCreatingPGs::finish(int)+0x79) [0x55f1110777b9]
    > 9: (Context::complete(int)+0x9) [0x55f110ed30c9]
    > 10: (ParallelPGMapper::WQ::_process(ParallelPGMapper::Item*, ThreadPool::TPHandle&)+0x7f) [0x55f111204e1f]
    > 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8e) [0x55f111100f1e]
    > 12: (ThreadPool::WorkThread::entry()+0x10) [0x55f111101e00]
    > 13: (()+0x7e25) [0x7fc64aaf2e25]
    > 14: (clone()+0x6d) [0x7fc64808d34d]
    > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
    > 
    > --
    > Efficiency is Intelligent Laziness
    > 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com