Re: MDS crashes (Jewel 10.2.10)

"Yan, Zheng" <ukernel@xxxxxxxxx> · Sun, 18 Mar 2018 22:04:03 +0800

On Sun, Mar 18, 2018 at 9:37 PM, Wyllys Ingersoll
<wyllys.ingersoll@xxxxxxxxxxxxxx> wrote:
> Yes, it does look like https://tracker.ceph.com/issues/21337, but we
> are seeing it on Jewel.
>

Jewel has the same issue.

> Also, I only have a single active MDS, the other 2 are standby -  is
> this ok to use with snapshots?
>

Snapshot is still experimental feature. It mostly works in single
active mds setup. but you may encounter issues.

> $ ceph mds stat
> e18481: 1/1/1 up {0=mon03=up:active}, 2 up:standby
>
>
>
>
>
>
>
> On Sun, Mar 18, 2018 at 5:44 AM, Yan, Zheng <zyan@xxxxxxxxxx> wrote:
>>
>>
>>> On 18 Mar 2018, at 06:59, Wyllys Ingersoll <wyllys.ingersoll@xxxxxxxxxxxxxx> wrote:
>>>
>>> No, I have 3 MDS and I am taking snapshots pretty regularly (capped at
>>> 360 total).
>>>
>>
>> snapshot support in multimds setup is not ready for general use. please don’t use snapshot or use single active mds setup.
>>
>>> I managed to recover and restart my mds (all 3) after using the
>>> cephfs-journal-tool an cephfs-table-tool reset features, but its
>>> worrisome that it got into that state in the first place.
>>>
>>> On Sat, Mar 17, 2018 at 2:11 PM, Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
>>>> Hello Wyllys,
>>>>
>>>> On Sat, Mar 17, 2018 at 6:37 AM, Wyllys Ingersoll
>>>> <wyllys.ingersoll@xxxxxxxxxxxxxx> wrote:
>>>>> Ubuntu 16.04.3
>>>>>
>>>>>
>>>>> One of my MDS servers keeps crashing and will not restart.  The
>>>>> cluster has 3 MDS, the other 2 are up, but the first one will not
>>>>> restart. The logs are below.  Any ideas what is wrong or how to get it
>>>>> back up and running?
>>>>
>>>> You only use one active, correct?
>>>>
>>>>> $ ceph -s
>>>>>    cluster ba0c94fc-1168-11e6-aaea-000c290cc2d4
>>>>>     health HEALTH_WARN
>>>>>            mds cluster is degraded
>>>>>     monmap e1: 3 mons at
>>>>> {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
>>>>>            election epoch 352, quorum 0,1,2 mon01,mon02,mon03
>>>>>      fsmap e18460: 1/1/1 up {0=mon03=up:replay}
>>>>>     osdmap e427025: 93 osds: 93 up, 89 in
>>>>>            flags sortbitwise,require_jewel_osds
>>>>>      pgmap v51310487: 18960 pgs, 21 pools, 26329 GB data, 12939 kobjects
>>>>>            80586 GB used, 188 TB / 267 TB avail
>>>>>               18960 active+clean
>>>>>  client io 0 B/s rd, 290 kB/s wr, 40 op/s rd, 87 op/s wr
>>>>>
>>>>>
>>>>>
>>>>> 2018-03-17 09:25:49.846771 7f425e3bf700 -1 *** Caught signal (Aborted) **
>>>>> in thread 7f425e3bf700 thread_name:md_log_replay
>>>>>
>>>>> ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>>>>> 1: (()+0x535e6e) [0x557481697e6e]
>>>>> 2: (()+0x11390) [0x7f426bfb6390]
>>>>> 3: (gsignal()+0x38) [0x7f426a39a428]
>>>>> 4: (abort()+0x16a) [0x7f426a39c02a]
>>>>> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>>> const*)+0x26b) [0x5574817a0aab]
>>>>> 6: (EOpen::replay(MDSRank*)+0x75e) [0x55748167da0e]
>>>>> 7: (MDLog::_replay_thread()+0xe38) [0x5574815fa718]
>>>>> 8: (MDLog::ReplayThread::entry()+0xd) [0x5574813ac09d]
>>>>> 9: (()+0x76ba) [0x7f426bfac6ba]
>>>>> 10: (clone()+0x6d) [0x7f426a46c3dd]
>>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>>>>> needed to interpret this.
>>>>
>>>> This looks like: https://tracker.ceph.com/issues/21337
>>>>
>>>> Are you using snapshots? The issue above was not backported to Jewel.
>>>>
>>>> --
>>>> Patrick Donnelly
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html