On Fri, Aug 2, 2024 at 8:27 AM <m@xxxxxxxxxxxx> wrote: > > I'm looking for guidance around how to recover after all MDS continue to crash with a failed assert during journal replay (no MON damage). > > Context: > > So I've been working through failed MDS for the past day, likely caused by a large snaptrim operation that caused the cluster to grind to a halt. > > After evicting all clients and restarting the MDS's (it appears the clients were overwhelming the MDS's). The MDS are failing to start with: > > debug -1> 2024-07-24T18:44:52.674+0000 7f7878c22700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osdc/Journaler.cc: In function 'bool Journaler::try_read_entry(ceph::bufferlist&)' thread 7f7878c22700 time 2024-07-24T18:44:52.676027+0000 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osdc/Journaler.cc: 1256: FAILED ceph_assert(start_ptr == read_pos) > ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f788aa32e15] > 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f788aa32fdb] > 3: (Journaler::try_read_entry(ceph::buffer::v15_2_0::list&)+0x132) [0x55555847ef32] > 4: (MDLog::_replay_thread()+0xda) [0x555558436bea] > 5: (MDLog::ReplayThread::entry()+0x11) [0x5555580e52d1] > 6: /lib64/libpthread.so.0(+0x81ca) [0x7f78897d81ca] > 7: clone() > debug 0> 2024-07-24T18:44:52.674+0000 7f7878c22700 -1 *** Caught signal (Aborted) ** > in thread 7f7878c22700 thread_name:md_log_replay > ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable) > 1: /lib64/libpthread.so.0(+0x12d20) [0x7f78897e2d20] > 2: gsignal() > 3: abort() > 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f788aa32e6f] > 5: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f788aa32fdb] > 6: (Journaler::try_read_entry(ceph::buffer::v15_2_0::list&)+0x132) [0x55555847ef32] > 7: (MDLog::_replay_thread()+0xda) [0x555558436bea] > 8: (MDLog::ReplayThread::entry()+0x11) [0x5555580e52d1] > 9: /lib64/libpthread.so.0(+0x81ca) [0x7f78897d81ca] > 10: clone() > > Normally, there's three MDS are deployed, 1 active, one on hot standby. The cluster seems to believe that any restarted MDS is attempting to replay, but systemd reports an immediate crashed with a SIGABRT. > > ceph mds stat > cephfs:1/1 {0=cephfs.sm1.esxjag=up:replay(laggy or crashed)} > > Redeploying the MDS's also continue to crash (suggesting a bad journal?) Do you have an MDS log to share that could possibly help identify the cause for this? I haven't seen this crash backtrace and this is in 18.2.2, so that's making me worried. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > -- Cheers, Venky _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx