Re: Help with seemingly damaged MDS rank(?)

John Spray <jspray@xxxxxxxxxx> · Mon, 16 May 2016 10:26:38 +0100

On Sun, May 15, 2016 at 3:08 AM, Skaag Argonius <skaag@xxxxxxxxx> wrote:
> One of the issues was a different version of ceph on the nodes. They are now all back to version 9.0.2 and things are looking a bit better.

What was the other version?  We recently encountered someone who had a
mix of MDS versions and that resulted in the old versioned MDSs
thinking a journal was damaged because the newer version had written
things they didn't understand.

> We've had some help from a ceph engineer, and the OSD's are now all up and running:
>
> [root@stor2 ceph]# ceph osd tree
> ID  WEIGHT    TYPE NAME            UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -13 1.00000 root spare
>  -9   1.00000     host stor4-spare
>   4   1.00000         osd.4             up  1.00000          1.00000
> -12   2.00000 root meta
>  -7   1.00000     host stor2-meta
>   7   1.00000         osd.7             up  1.00000          1.00000
>  -8   1.00000     host stor4-meta
>   5   1.00000         osd.5             up  1.00000          1.00000
> -11   2.00000 root ssd
>  -4   1.00000     host stor2-ssd
>   2   1.00000         osd.2             up  1.00000          1.00000
>  -5   1.00000     host stor4-ssd
>   3   1.00000         osd.3             up  1.00000          1.00000
> -10 291.03998 root default
>  -2  72.75999     host stor2
>   0  72.75999         osd.0             up  1.00000          1.00000
>  -3  72.75999     host stor4
>   1  72.75999         osd.1             up  1.00000          1.00000
>  -1  72.75999     host stor1
>   6  72.75999         osd.6             up  1.00000          1.00000
> -14  72.75999     host stor3
>   8  72.75999         osd.8             up  1.00000          1.00000
>
>
> Here's an updated ceph status:
>
> [root@stor2 ceph]# ceph -s
>     cluster 926a1578-48e8-4559-b1b5-1f1a9b8e7566
>      health HEALTH_ERR
>             mds rank 2 is damaged
>             mds cluster is degraded
>             mds stor4 is laggy
>      monmap e23: 1 mons at {0=192.168.202.101:6789/0}
>             election epoch 1, quorum 0 0
>      mdsmap e11400: 1/3/2 up {0=stor4=up:replay(laggy or crashed)}, 1 damaged
>      osdmap e12828: 9 osds: 9 up, 9 in
>       pgmap v22081981: 224 pgs, 4 pools, 101 TB data, 62358 kobjects
>             203 TB used, 90068 GB / 291 TB avail
>                  222 active+clean
>                    2 active+clean+scrubbing+deep
>
>
> And here's a ceph mds dump:
>
> [root@stor2 ceph]# ceph mds dump
> dumped mdsmap epoch 11400
> epoch   11400
> flags   0
> created 2015-04-20 02:39:19.238757
> modified        2016-05-14 17:15:24.482294
> tableserver     0
> root    0
> session_timeout 60
> session_autoclose       300
> max_file_size   1099511627776
> last_failure    11399
> last_failure_osd_epoch  12827
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
> max_mds 2
> in      0,1,2
> up      {0=10024473}
> failed
> damaged 2
> stopped
> data_pools      1
> metadata_pool   2
> inline_data     disabled
> 10024473:       192.168.202.13:6816/3461545 'stor4' mds.0.276 up:replay seq 1 laggy since 2016-05-14 17:15:24.482177
>
>
>
> But as you can see above, the mds is still damaged and the cluster is degraded, and our files are still inaccessible.
>
> Any ideas how to fix an mds with damaged rank?

Your first step is to work out why it went damaged to begin with.
Unfortunately in the version you're running you're hitting "FAILED
assert(mds->mds_lock.is_locked_by_me())" when the MDS is trying to
report damage during journal replay (this bug has since been fixed),
so you may not be getting a nice cluster log message when the rank
goes damaged.

I also notice that you're using three active MDSs.  We recommend only
using a single active MDS at the moment, as that's the most tested
configuration (http://docs.ceph.com/docs/hammer/cephfs/early-adopters/)

Anyway.  Stop all your MDSs.  Run "ceph mds repaired 2" to clear the
damaged flag (this doesn't fix anything, but it'll make the MDS try
again to start so that we can see an error).  Then start your MDSs one
by one, until one crashes.  Take the one that crashes and examine its
log to see what caused it (not just the backtrace but what was going
on immediately before).

John

> Thanks again,
>
> Skaag
>
>
>
>
>> On May 14, 2016, at 9:00 AM, Skaag Argonius <skaag@xxxxxxxxx> wrote:
>>
>> We have a rather urgent situation, and I need help doing one of two things:
>>
>> 1. Fix the MDS and regain a working cluster (ideal)
>> 2. Find a way to extract the contents so I can move it to a new cluster (need to do this anyway)
>>
>> We have 4 physical storage machines: stor1, stor2, stor3, stor4
>> The MDS is on stor2.
>>
>> I'm including as much info as I can below.
>>
>>
>> -------8<-----------------------------------------------------------------------------
>>
>> Cluster status:
>>
>> [root@stor1 /]# ceph -s
>>    cluster 926a1578-48e8-4559-b1b5-1f1a9b8e7566
>>     health HEALTH_ERR
>>            2 pgs backfill
>>            24 pgs backfilling
>>            122 pgs degraded
>>            14 pgs stale
>>            122 pgs stuck degraded
>>            14 pgs stuck stale
>>            122 pgs stuck unclean
>>            122 pgs stuck undersized
>>            122 pgs undersized
>>            recovery 26146044/127710264 objects degraded (20.473%)
>>            recovery 21267236/127710264 objects misplaced (16.653%)
>>            mds ranks 0,2 have failed
>>            mds cluster is degraded
>>            mds gentoo-backup is laggy
>>     monmap e15: 1 mons at {0=192.168.202.101:6789/0}
>>            election epoch 1, quorum 0 0
>>     mdsmap e11373: 1/2/2 up {1=gentoo-backup=up:replay(laggy or crashed)}, 2 failed
>>     osdmap e12755: 9 osds: 6 up, 6 in; 26 remapped pgs
>>      pgmap v22065162: 224 pgs, 4 pools, 101 TB data, 62358 kobjects
>>            169 TB used, 50604 GB / 218 TB avail
>>            26146044/127710264 objects degraded (20.473%)
>>            21267236/127710264 objects misplaced (16.653%)
>>                  96 active+undersized+degraded
>>                  87 active+clean
>>                  24 active+undersized+degraded+remapped+backfilling
>>                  14 stale+active+clean
>>                   2 active+undersized+degraded+remapped+wait_backfill
>>                   1 active+clean+scrubbing+deep
>> recovery io 100993 kB/s, 68 objects/s
>>
>> -------8<-----------------------------------------------------------------------------
>>
>>
>> History:
>>
>> We had a single MDS, things were working fine. One of our sysadmins changed permissions in some sensitive folder, apparently (I don't have all the details).
>>
>> As a result the MDS went down. He then tried to bring a backup mds into the mix (called 'gentoo-backup'), but I think that MDS was an exact replica of the first MDS, so obviously it wasn't going to work well...
>>
>> We tried to kick the backup MDS out, tried to mark it as failed, tried to make it inactive, tried to reduce max_mds to 1, nothing helped.
>>
>>
>> -------8<-----------------------------------------------------------------------------
>>
>> MDS Crash logs:
>>
>> 2016-05-13 15:19:27.447753 7f7a82580700 -1 log_channel(cluster) log [ERR] : corrupt journal event at 4194304~472 / 10426264
>> 2016-05-13 15:19:27.448901 7f7a82580700 -1 mds/Beacon.cc: In function 'void Beacon::notify_health(const MDS*)' thread 7f7a82580700 time 2016-05-13 15:19:27.447829
>> mds/Beacon.cc: 291: FAILED assert(mds->mds_lock.is_locked_by_me())
>>
>> ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0)
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x98b645]
>> 2: (Beacon::notify_health(MDS const*)+0x74) [0x5d7d54]
>> 3: (MDS::damaged()+0x42) [0x5a9bd2]
>> 4: (MDLog::_replay_thread()+0x1c02) [0x803e62]
>> 5: (MDLog::ReplayThread::entry()+0xd) [0x5c88bd]
>> 6: (()+0x7df5) [0x7f7a8c564df5]
>> 7: (clone()+0x6d) [0x7f7a8b24b1ad]
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>
>> -------8<-----------------------------------------------------------------------------
>>
>> More from MDS log:
>>
>> 2016-05-13 15:19:27.451896 7f7a82580700 -1 *** Caught signal (Aborted) **
>> in thread 7f7a82580700
>>
>> ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0)
>> 1: /usr/bin/ceph-mds() [0x8916b2]
>> 2: (()+0xf130) [0x7f7a8c56c130]
>> 3: (gsignal()+0x37) [0x7f7a8b18a5d7]
>> 4: (abort()+0x148) [0x7f7a8b18bcc8]
>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f7a8ba8e9b5]
>> 6: (()+0x5e926) [0x7f7a8ba8c926]
>> 7: (()+0x5e953) [0x7f7a8ba8c953]
>> 8: (()+0x5eb73) [0x7f7a8ba8cb73]
>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0x98b83a]
>> 10: (Beacon::notify_health(MDS const*)+0x74) [0x5d7d54]
>> 11: (MDS::damaged()+0x42) [0x5a9bd2]
>> 12: (MDLog::_replay_thread()+0x1c02) [0x803e62]
>> 13: (MDLog::ReplayThread::entry()+0xd) [0x5c88bd]
>> 14: (()+0x7df5) [0x7f7a8c564df5]
>> 15: (clone()+0x6d) [0x7f7a8b24b1ad]
>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>
>> ...
>> ...
>> ...
>>
>>  log_file /var/log/ceph/ceph-mds.stor2.log
>> --- end dump of recent events ---
>> 2016-05-13 15:54:53.300449 7f1e1f47c7c0  0 ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0), process ceph-mds, pid 3935832
>> 2016-05-13 15:54:53.324083 7f1e1f47c7c0 -1 mds.-1.0 log_to_monitors {default=true}
>> 2016-05-13 15:54:53.572405 7f1e19aaf700  1 mds.-1.0 handle_mds_map standby
>> 2016-05-13 15:56:17.247033 7f1e151a5700 -1 mds.-1.0 *** got signal Terminated ***
>> 2016-05-13 15:56:17.250635 7f1e151a5700  1 mds.-1.0 suicide.  wanted down:dne, now up:standby
>> 2016-05-13 15:56:18.405716 7f6ba594c7c0  0 ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0), process ceph-mds, pid 3940158
>> 2016-05-13 15:56:18.428399 7f6ba594c7c0 -1 mds.-1.0 log_to_monitors {default=true}
>> 2016-05-13 15:56:18.984429 7f6b9ff7f700  1 mds.-1.0 handle_mds_map standby
>> 2016-05-13 15:56:59.516357 7f6b9ff7f700  1 mds.-1.0 handle_mds_map standby
>> 2016-05-13 15:57:02.637735 7f6b9ff7f700  1 mds.-1.0 handle_mds_map standby
>> 2016-05-13 15:57:05.774427 7f6b9ff7f700  1 mds.-1.0 handle_mds_map standby
>>
>> -------8<-----------------------------------------------------------------------------
>>
>> Last portion from MDS log on stor2 (from right now):
>>
>> 2016-05-13 19:10:15.609077 7f0e77eab780  0 ceph version 9.1.0 (3be81ae6cf17fcf689cd6f187c4615249fea4f61), process ceph-mds, pid 4028429
>> 2016-05-13 19:10:16.464975 7f0e726d8700  1 mds.stor2 handle_mds_map standby
>> 2016-05-13 19:10:16.467750 7f0e726d8700  1 mds.0.267 handle_mds_map i am now mds.0.267
>> 2016-05-13 19:10:16.467757 7f0e726d8700  1 mds.0.267 handle_mds_map state change up:boot --> up:replay
>> 2016-05-13 19:10:16.467778 7f0e726d8700  1 mds.0.267 replay_start
>> 2016-05-13 19:10:16.467784 7f0e726d8700  1 mds.0.267  recovery set is 1
>> 2016-05-13 19:10:16.467789 7f0e726d8700  1 mds.0.267  waiting for osdmap 12618 (which blacklists prior instance)
>> 2016-05-13 19:10:16.470750 7f0e6d4cc700  0 mds.0.cache creating system inode with ino:100
>> 2016-05-13 19:10:16.470880 7f0e6d4cc700  0 mds.0.cache creating system inode with ino:1
>> 2016-05-13 19:10:16.878435 7f0e6c1c7700  1 mds.0.267 replay_done
>> 2016-05-13 19:10:16.878454 7f0e6c1c7700  1 mds.0.267 making mds journal writeable
>> 2016-05-13 19:10:17.681414 7f0e726d8700  1 mds.0.267 handle_mds_map i am now mds.0.267
>> 2016-05-13 19:10:17.681419 7f0e726d8700  1 mds.0.267 handle_mds_map state change up:replay --> up:resolve
>> 2016-05-13 19:10:17.681431 7f0e726d8700  1 mds.0.267 resolve_start
>> 2016-05-13 19:10:17.681432 7f0e726d8700  1 mds.0.267 reopen_log
>> 2016-05-13 19:11:45.822580 7f0e726d8700  1 mds.0.cache handle_mds_failure mds.1 : recovery peers are 1
>> 2016-05-13 19:12:17.462003 7f0e726d8700  1 mds.0.cache handle_mds_failure mds.1 : recovery peers are 1
>> 2016-05-13 19:12:31.705710 7f0e726d8700  1 mds.0.cache handle_mds_failure mds.1 : recovery peers are 1
>> 2016-05-13 19:12:51.942145 7f0e726d8700  1 mds.0.cache handle_mds_failure mds.1 : recovery peers are 1
>> 2016-05-13 19:13:59.720073 7f0e6edd0700 -1 mds.stor2 *** got signal Terminated ***
>> 2016-05-13 19:13:59.720130 7f0e6edd0700  1 mds.stor2 suicide.  wanted state up:resolve
>> 2016-05-13 19:13:59.722837 7f0e6edd0700  1 mds.0.267 shutdown: shutting down rank 0
>>
>>
>> -------8<-----------------------------------------------------------------------------
>>
>> My own experience with storage systems is limited to NFS, GlusterFS and various exotic Fuse based filesystems. I came into this startup with ceph already installed the way it is (I inherited it), so be gentle with me please :-)
>>
>> Thanks in advance for any help/advice you can give!
>>
>> Skaag
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com