Re: mds fail ing to start 14.2.2

Kenneth Waegeman <kenneth.waegeman@xxxxxxxx> · Tue, 15 Oct 2019 09:29:12 +0200



    Hi Zheng,
    Thanks, that let me think I forgot to remove some 'temporary-key'
      for the inconsistency issue I've got. Once those were removed,the
      mds started again. 

    
    Thanks again!
    Kenneth

    
    On 12/10/2019 04:26, Yan, Zheng wrote:

    
          On Sat, Oct 12, 2019 at 1:10
            AM Kenneth Waegeman <kenneth.waegeman@xxxxxxxx>
            wrote:

          
          Hi all,

            
            After solving some pg inconsistency problems, my fs is still
            in 

            trouble.  my mds's are crashing with this error:

            
            >     -5> 2019-10-11 19:02:55.375 7f2d39f10700  1
            mds.1.564276 rejoin_start

            >     -4> 2019-10-11 19:02:55.385 7f2d3d717700  5
            mds.beacon.mds01 

            > received beacon reply up:rejoin seq 5 rtt 1.01

            >     -3> 2019-10-11 19:02:55.495 7f2d39f10700  1
            mds.1.564276 

            > rejoin_joint_start

            >     -2> 2019-10-11 19:02:55.505 7f2d39f10700  5
            mds.mds01 

            > handle_mds_map old map epoch 564279 <= 564279,
            discarding

            >     -1> 2019-10-11 19:02:55.695 7f2d33f04700 -1 

            >
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/mdstyp

            > es.h: In function 'static void 

            > dentry_key_t::decode_helper(std::string_view,
            std::string&, 

            > snapid_t&)' thread 7f2d33f04700 time 2019-10-11
            19:02:55.703343

            >
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/mdstypes.h:
            

            > 1229: FAILED ceph_assert(i != string::npos

            > )

            >

            >  ceph version 14.2.2
            (4f8fa0a0024755aae7d95567c63f11d6862d55be) 

            > nautilus (stable)

            >  1: (ceph::__ceph_assert_fail(char const*, char const*,
            int, char 

            > const*)+0x14a) [0x7f2d43393046]

            >  2: (ceph::__ceph_assertf_fail(char const*, char
            const*, int, char 

            > const*, char const*, ...)+0) [0x7f2d43393214]

            >  3:
            (CDir::_omap_fetched(ceph::buffer::v14_2_0::list&, 

            > std::map<std::string, ceph::buffer::v14_2_0::list, 

            > std::less<std::string>,
            std::allocator<std::pair<std::string const, 

            > ceph::buffer::v14_2_0::list> > >&, bool,
            int)+0xa68) [0x556a17ec

            > baa8]

            >  4: (C_IO_Dir_OMAP_Fetched::finish(int)+0x54)
            [0x556a17ee0034]

            >  5: (MDSContext::complete(int)+0x70) [0x556a17f5e710]

            >  6: (MDSIOContextBase::complete(int)+0x16b)
            [0x556a17f5e9ab]

            >  7: (Finisher::finisher_thread_entry()+0x156)
            [0x7f2d433d8386]

            >  8: (()+0x7dd5) [0x7f2d41262dd5]

            >  9: (clone()+0x6d) [0x7f2d3ff1302d]

            >

            >      0> 2019-10-11 19:02:55.695 7f2d33f04700 -1 ***
            Caught signal 

            > (Aborted) **

            >  in thread 7f2d33f04700 thread_name:fn_anonymous

            >

            >  ceph version 14.2.2
            (4f8fa0a0024755aae7d95567c63f11d6862d55be) 

            > nautilus (stable)

            >  1: (()+0xf5d0) [0x7f2d4126a5d0]

            >  2: (gsignal()+0x37) [0x7f2d3fe4b2c7]

            >  3: (abort()+0x148) [0x7f2d3fe4c9b8]

            >  4: (ceph::__ceph_assert_fail(char const*, char const*,
            int, char 

            > const*)+0x199) [0x7f2d43393095]

            >  5: (ceph::__ceph_assertf_fail(char const*, char
            const*, int, char 

            > const*, char const*, ...)+0) [0x7f2d43393214]

            >  6:
            (CDir::_omap_fetched(ceph::buffer::v14_2_0::list&, 

            > std::map<std::string, ceph::buffer::v14_2_0::list, 

            > std::less<std::string>,
            std::allocator<std::pair<std::string const, 

            > ceph::buffer::v14_2_0::list> > >&, bool,
            int)+0xa68) [0x556a17ec

            > baa8]

            >  7: (C_IO_Dir_OMAP_Fetched::finish(int)+0x54)
            [0x556a17ee0034]

            >  8: (MDSContext::complete(int)+0x70) [0x556a17f5e710]

            >  9: (MDSIOContextBase::complete(int)+0x16b)
            [0x556a17f5e9ab]

            >  10: (Finisher::finisher_thread_entry()+0x156)
            [0x7f2d433d8386]

            >  11: (()+0x7dd5) [0x7f2d41262dd5]

            >  12: (clone()+0x6d) [0x7f2d3ff1302d]

            >  NOTE: a copy of the executable, or `objdump -rdS
            <executable>` is 

            > needed to interpret this.

            >

            > [root@mds02 ~]# ceph -s

            >   cluster:

            >     id:     92bfcf0a-1d39-43b3-b60f-44f01b630e47

            >     health: HEALTH_WARN

            >             1 filesystem is degraded

            >             insufficient standby MDS daemons available

            >             1 MDSs behind on trimming

            >             1 large omap objects

            >

            >   services:

            >     mon: 3 daemons, quorum mds01,mds02,mds03 (age 4d)

            >     mgr: mds02(active, since 3w), standbys: mds01,
            mds03

            >     mds: ceph_fs:2/2
            {0=mds02=up:rejoin,1=mds01=up:rejoin(laggy or 

            > crashed)}

            >     osd: 535 osds: 533 up, 529 in

            >

            >   data:

            >     pools:   3 pools, 3328 pgs

            >     objects: 376.32M objects, 673 TiB

            >     usage:   1.0 PiB used, 2.2 PiB / 3.2 PiB avail

            >     pgs:     3315 active+clean

            >              12   active+clean+scrubbing+deep

            >              1    active+clean+scrubbing

            >

            Someone an idea where to go from here ?☺

            
          looks like omap for dirfrag is corrupted.  please check
            mds log (debug_mds = 10) to find which omap is corrupted.
            Basically all omap keys of dirfrag should be in format
            xxxx_xxxx.
          

            Thanks!

            
            K

            
            _______________________________________________

            ceph-users mailing list

            ceph-users@xxxxxxxxxxxxxx

            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

          
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com