No, I have 3 MDS and I am taking snapshots pretty regularly (capped at 360 total). I managed to recover and restart my mds (all 3) after using the cephfs-journal-tool an cephfs-table-tool reset features, but its worrisome that it got into that state in the first place. On Sat, Mar 17, 2018 at 2:11 PM, Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > Hello Wyllys, > > On Sat, Mar 17, 2018 at 6:37 AM, Wyllys Ingersoll > <wyllys.ingersoll@xxxxxxxxxxxxxx> wrote: >> Ubuntu 16.04.3 >> >> >> One of my MDS servers keeps crashing and will not restart. The >> cluster has 3 MDS, the other 2 are up, but the first one will not >> restart. The logs are below. Any ideas what is wrong or how to get it >> back up and running? > > You only use one active, correct? > >> $ ceph -s >> cluster ba0c94fc-1168-11e6-aaea-000c290cc2d4 >> health HEALTH_WARN >> mds cluster is degraded >> monmap e1: 3 mons at >> {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0} >> election epoch 352, quorum 0,1,2 mon01,mon02,mon03 >> fsmap e18460: 1/1/1 up {0=mon03=up:replay} >> osdmap e427025: 93 osds: 93 up, 89 in >> flags sortbitwise,require_jewel_osds >> pgmap v51310487: 18960 pgs, 21 pools, 26329 GB data, 12939 kobjects >> 80586 GB used, 188 TB / 267 TB avail >> 18960 active+clean >> client io 0 B/s rd, 290 kB/s wr, 40 op/s rd, 87 op/s wr >> >> >> >> 2018-03-17 09:25:49.846771 7f425e3bf700 -1 *** Caught signal (Aborted) ** >> in thread 7f425e3bf700 thread_name:md_log_replay >> >> ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) >> 1: (()+0x535e6e) [0x557481697e6e] >> 2: (()+0x11390) [0x7f426bfb6390] >> 3: (gsignal()+0x38) [0x7f426a39a428] >> 4: (abort()+0x16a) [0x7f426a39c02a] >> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x26b) [0x5574817a0aab] >> 6: (EOpen::replay(MDSRank*)+0x75e) [0x55748167da0e] >> 7: (MDLog::_replay_thread()+0xe38) [0x5574815fa718] >> 8: (MDLog::ReplayThread::entry()+0xd) [0x5574813ac09d] >> 9: (()+0x76ba) [0x7f426bfac6ba] >> 10: (clone()+0x6d) [0x7f426a46c3dd] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >> needed to interpret this. > > This looks like: https://tracker.ceph.com/issues/21337 > > Are you using snapshots? The issue above was not backported to Jewel. > > -- > Patrick Donnelly -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html