Hi,
have you checked the MDS journal for any damage (replace {CEPHFS} with
the name of your filesystem)?
cephfs-journal-tool --rank={CEPHFS}:all journal inspect
Zitat von m@xxxxxxxxxxxx:
I'm looking for guidance around how to recover after all MDS
continue to crash with a failed assert during journal replay (no MON
damage).
Context:
So I've been working through failed MDS for the past day, likely
caused by a large snaptrim operation that caused the cluster to
grind to a halt.
After evicting all clients and restarting the MDS's (it appears the
clients were overwhelming the MDS's). The MDS are failing to start
with:
debug -1> 2024-07-24T18:44:52.674+0000 7f7878c22700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osdc/Journaler.cc: In function 'bool Journaler::try_read_entry(ceph::bufferlist&)' thread 7f7878c22700 time
2024-07-24T18:44:52.676027+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osdc/Journaler.cc: 1256: FAILED ceph_assert(start_ptr ==
read_pos)
ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x135) [0x7f788aa32e15]
2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f788aa32fdb]
3: (Journaler::try_read_entry(ceph::buffer::v15_2_0::list&)+0x132)
[0x55555847ef32]
4: (MDLog::_replay_thread()+0xda) [0x555558436bea]
5: (MDLog::ReplayThread::entry()+0x11) [0x5555580e52d1]
6: /lib64/libpthread.so.0(+0x81ca) [0x7f78897d81ca]
7: clone()
debug 0> 2024-07-24T18:44:52.674+0000 7f7878c22700 -1 ***
Caught signal (Aborted) **
in thread 7f7878c22700 thread_name:md_log_replay
ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
1: /lib64/libpthread.so.0(+0x12d20) [0x7f78897e2d20]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x18f) [0x7f788aa32e6f]
5: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f788aa32fdb]
6: (Journaler::try_read_entry(ceph::buffer::v15_2_0::list&)+0x132)
[0x55555847ef32]
7: (MDLog::_replay_thread()+0xda) [0x555558436bea]
8: (MDLog::ReplayThread::entry()+0x11) [0x5555580e52d1]
9: /lib64/libpthread.so.0(+0x81ca) [0x7f78897d81ca]
10: clone()
Normally, there's three MDS are deployed, 1 active, one on hot
standby. The cluster seems to believe that any restarted MDS is
attempting to replay, but systemd reports an immediate crashed with
a SIGABRT.
ceph mds stat
cephfs:1/1 {0=cephfs.sm1.esxjag=up:replay(laggy or crashed)}
Redeploying the MDS's also continue to crash (suggesting a bad journal?)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx