Hi Today we suddenly experience multiple MDS crashes during the day with an error we have not seen earlier. We run octopus 15.2.13 with 4 ranks and 4 standby-reply MDSes and 1 passive standby. Any input on how to troubleshot or resolve this would be most welcome. --- root@hk-cephnode-54:~# ceph crash ls 2021-08-09T08:06:41.573899Z_306a9a10-b9d7-4a68-83a9-f5bd3d700fd7 mds.hk-cephnode-58 2021-08-09T08:09:03.132838Z_9a62b1fc-6069-4576-974d-2e0464169bb5 mds.hk-cephnode-62 2021-08-09T11:20:23.776776Z_5a665d00-9862-4d8f-99b5-323cdf441966 mds.hk-cephnode-54 2021-08-09T11:25:14.213601Z_f47fa398-5582-4da6-8e18-9252bbb52805 mds.hk-cephnode-62 2021-08-09T12:44:34.190128Z_1e163bf2-6ddf-45ef-a80f-0bf42158da31 mds.hk-cephnode-60 --- *All the crashlogs have the same assert_condition/file/msg* root@hk-cephnode-54:~# ceph crash info 2021-08-09T12:44:34.190128Z_1e163bf2-6ddf-45ef-a80f-0bf42158da31 { "archived": "2021-08-09 12:53:01.429088", "assert_condition": "state == LOCK_XLOCK || state == LOCK_XLOCKDONE", "assert_file": "/build/ceph/ceph-15.2.13/src/mds/ScatterLock.h", "assert_func": "void ScatterLock::set_xlock_snap_sync(MDSContext*)", "assert_line": 59, "assert_msg": "/build/ceph/ceph-15.2.13/src/mds/ScatterLock.h: In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7f0f76853700 time 2021-08-09T14:44:34.185861+0200\n/build/ceph/ceph-15.2.13/src/mds/ScatterLock.h: 59: FAILED ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE)\n", "assert_thread_name": "MR_Finisher", "backtrace": [ "(()+0x12730) [0x7f0f8153d730]", "(gsignal()+0x10b) [0x7f0f80e027bb]", "(abort()+0x121) [0x7f0f80ded535]", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a5) [0x7f0f81f1d0f5]", "(()+0x28127c) [0x7f0f81f1d27c]", "(MDCache::truncate_inode(CInode*, LogSegment*)+0x305) [0x55ed3b243aa5]", "(C_MDS_inode_update_finish::finish(int)+0x14c) [0x55ed3b219dec]", "(MDSContext::complete(int)+0x52) [0x55ed3b4156d2]", "(MDSIOContextBase::complete(int)+0x9f) [0x55ed3b4158af]", "(MDSLogContextBase::complete(int)+0x40) [0x55ed3b415c30]", "(Finisher::finisher_thread_entry()+0x19d) [0x7f0f81fab73d]", "(()+0x7fa3) [0x7f0f81532fa3]", "(clone()+0x3f) [0x7f0f80ec44cf]" ], "ceph_version": "15.2.13", "crash_id": "2021-08-09T12:44:34.190128Z_1e163bf2-6ddf-45ef-a80f-0bf42158da31", "entity_name": "mds.hk-cephnode-60", "os_id": "10", "os_name": "Debian GNU/Linux 10 (buster)", "os_version": "10 (buster)", "os_version_id": "10", "process_name": "ceph-mds", "stack_sig": "5f310d14ffe4b2600195c874fba3761c268218711ee4a449413862bb5553fb4c", "timestamp": "2021-08-09T12:44:34.190128Z", "utsname_hostname": "hk-cephnode-60", "utsname_machine": "x86_64", "utsname_release": "5.4.114-1-pve", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200)» } --- root@hk-cephnode-54:~# ceph health detail HEALTH_WARN 1 daemons have recently crashed [WRN] RECENT_CRASH: 1 daemons have recently crashed mds.hk-cephnode-54 crashed on host hk-cephnode-54 at 2021-08-09T11:20:23.776776Z root@hk-cephnode-54:~# ceph status cluster: id: xxxx health: HEALTH_WARN 1 daemons have recently crashed services: mon: 3 daemons, quorum hk-cephnode-60,hk-cephnode-61,hk-cephnode-62 (age 4w) mgr: hk-cephnode-53(active, since 4h), standbys: hk-cephnode-51, hk-cephnode-52 mds: cephfs:4 {0=hk-cephnode-60=up:active,1=hk-cephnode-61=up:active,2=hk-cephnode-55=up:active,3=hk-cephnode-57=up:active} 4 up:standby-replay 1 up:standby osd: 180 osds: 180 up (since 5d), 180 in (since 2w) data: pools: 9 pools, 2433 pgs objects: 118.22M objects, 331 TiB usage: 935 TiB used, 990 TiB / 1.9 PiB avail pgs: 2433 active+clean io: client: 231 MiB/s rd, 146 MiB/s wr, 900 op/s rd, 4.07k op/s wr _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx