Re: All MDS's Crashed, Failed Assert

Venky Shankar <vshankar@xxxxxxxxxx> · Wed, 14 Aug 2024 13:05:35 +0530

On Fri, Aug 2, 2024 at 8:27 AM <m@xxxxxxxxxxxx> wrote:
>
> I'm looking for guidance around how to recover after all MDS continue to crash with a failed assert during journal replay (no MON damage).
>
> Context:
>
> So I've been working through failed MDS for the past day, likely caused by a large snaptrim operation that caused the cluster to grind to a halt.
>
> After evicting all clients and restarting the MDS's (it appears the clients were overwhelming the MDS's). The MDS are failing to start with:
>
> debug     -1> 2024-07-24T18:44:52.674+0000 7f7878c22700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osdc/Journaler.cc: In function 'bool Journaler::try_read_entry(ceph::bufferlist&)' thread 7f7878c22700 time 2024-07-24T18:44:52.676027+0000
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/osdc/Journaler.cc: 1256: FAILED ceph_assert(start_ptr == read_pos)
>  ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f788aa32e15]
>  2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f788aa32fdb]
>  3: (Journaler::try_read_entry(ceph::buffer::v15_2_0::list&)+0x132) [0x55555847ef32]
>  4: (MDLog::_replay_thread()+0xda) [0x555558436bea]
>  5: (MDLog::ReplayThread::entry()+0x11) [0x5555580e52d1]
>  6: /lib64/libpthread.so.0(+0x81ca) [0x7f78897d81ca]
>  7: clone()
> debug      0> 2024-07-24T18:44:52.674+0000 7f7878c22700 -1 *** Caught signal (Aborted) **
>  in thread 7f7878c22700 thread_name:md_log_replay
>  ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
>  1: /lib64/libpthread.so.0(+0x12d20) [0x7f78897e2d20]
>  2: gsignal()
>  3: abort()
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f788aa32e6f]
>  5: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f788aa32fdb]
>  6: (Journaler::try_read_entry(ceph::buffer::v15_2_0::list&)+0x132) [0x55555847ef32]
>  7: (MDLog::_replay_thread()+0xda) [0x555558436bea]
>  8: (MDLog::ReplayThread::entry()+0x11) [0x5555580e52d1]
>  9: /lib64/libpthread.so.0(+0x81ca) [0x7f78897d81ca]
>  10: clone()
>
> Normally, there's three MDS are deployed, 1 active, one on hot standby. The cluster seems to believe that any restarted MDS is attempting to replay, but systemd reports an immediate crashed with a SIGABRT.
>
> ceph mds stat
> cephfs:1/1 {0=cephfs.sm1.esxjag=up:replay(laggy or crashed)}
>
> Redeploying the MDS's also continue to crash (suggesting a bad journal?)

Do you have an MDS log to share that could possibly help identify the
cause for this? I haven't seen this crash backtrace and this is in
18.2.2, so that's making me worried.

> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
Cheers,
Venky
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx