Dear Xiubo, thanks for that link. It seems like its a harmless issue. I believe I have seen a blocked OP in the ceph warnings for this MDS and was quite happy it restarted by itself. Looks like its a very rare race condition and does not lead to data loss or corruption. In a situation like this, is it normal that the MDS host is blacklisted? The MDS reconnected just fine. Is it the MDS client ID of the crashed MDS that is blocked? I can't see anything that is denied access. Thanks for your reply and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Xiubo Li <xiubli@xxxxxxxxxx> Sent: 22 March 2023 07:27:08 To: Frank Schilder; ceph-users@xxxxxxx Subject: Re: MDS host in OSD blacklist Hi Frank, This should be the same issue with https://tracker.ceph.com/issues/49132, which has been fixed. Thanks - Xiubo On 21/03/2023 23:32, Frank Schilder wrote: > Hi all, > > we have an octopus v15.2.17 cluster and observe that one of our MDS hosts showed up in the OSD blacklist: > > # ceph osd blacklist ls > 192.168.32.87:6801/3841823949 2023-03-22T10:08:02.589698+0100 > 192.168.32.87:6800/3841823949 2023-03-22T10:08:02.589698+0100 > > I see an MDS restart that might be related; see log snippets below. There are no clients running on this host, only OSDs and one MDS. What could be the reason for the blacklist entries? > > Thanks! > > Log snippets: > > Mar 21 10:07:54 ceph-23 journal: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/mds/ScatterLock.h: In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7f99e63d5700 time 2023-03-21T10:07:54.967936+0100 > Mar 21 10:07:54 ceph-23 journal: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/mds/ScatterLock.h: 59: FAILED ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE) > Mar 21 10:07:54 ceph-23 journal: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) > Mar 21 10:07:54 ceph-23 journal: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f99f4a25b92] > Mar 21 10:07:54 ceph-23 journal: 2: (()+0x27ddac) [0x7f99f4a25dac] > Mar 21 10:07:54 ceph-23 journal: 3: (MDCache::truncate_inode(CInode*, LogSegment*)+0x32c) [0x561bd623962c] > Mar 21 10:07:54 ceph-23 journal: 4: (C_MDS_inode_update_finish::finish(int)+0x133) [0x561bd6210a83] > Mar 21 10:07:54 ceph-23 journal: 5: (MDSContext::complete(int)+0x56) [0x561bd6422656] > Mar 21 10:07:54 ceph-23 journal: 6: (MDSIOContextBase::complete(int)+0x39c) [0x561bd6422b5c] > Mar 21 10:07:54 ceph-23 journal: 7: (MDSLogContextBase::complete(int)+0x44) [0x561bd6422cb4] > Mar 21 10:07:54 ceph-23 journal: 8: (Finisher::finisher_thread_entry()+0x1a5) [0x7f99f4ab6a95] > Mar 21 10:07:54 ceph-23 journal: 9: (()+0x81ca) [0x7f99f35fb1ca] > Mar 21 10:07:54 ceph-23 journal: 10: (clone()+0x43) [0x7f99f204ddd3] > Mar 21 10:07:54 ceph-23 journal: *** Caught signal (Aborted) ** > Mar 21 10:07:54 ceph-23 journal: in thread 7f99e63d5700 thread_name:MR_Finisher > Mar 21 10:07:54 ceph-23 journal: 2023-03-21T10:07:54.980+0100 7f99e63d5700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/mds/ScatterLock.h: In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7f99e63d5700 time 2023-03-21T10:07:54.967936+0100 > Mar 21 10:07:54 ceph-23 journal: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/mds/ScatterLock.h: 59: FAILED ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE) > Mar 21 10:07:54 ceph-23 journal: > Mar 21 10:07:54 ceph-23 journal: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) > Mar 21 10:07:54 ceph-23 journal: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f99f4a25b92] > Mar 21 10:07:54 ceph-23 journal: 2: (()+0x27ddac) [0x7f99f4a25dac] > Mar 21 10:07:54 ceph-23 journal: 3: (MDCache::truncate_inode(CInode*, LogSegment*)+0x32c) [0x561bd623962c] > Mar 21 10:07:54 ceph-23 journal: 4: (C_MDS_inode_update_finish::finish(int)+0x133) [0x561bd6210a83] > Mar 21 10:07:54 ceph-23 journal: 5: (MDSContext::complete(int)+0x56) [0x561bd6422656] > Mar 21 10:07:54 ceph-23 journal: 6: (MDSIOContextBase::complete(int)+0x39c) [0x561bd6422b5c] > Mar 21 10:07:54 ceph-23 journal: 7: (MDSLogContextBase::complete(int)+0x44) [0x561bd6422cb4] > Mar 21 10:07:54 ceph-23 journal: 8: (Finisher::finisher_thread_entry()+0x1a5) [0x7f99f4ab6a95] > Mar 21 10:07:54 ceph-23 journal: 9: (()+0x81ca) [0x7f99f35fb1ca] > Mar 21 10:07:54 ceph-23 journal: 10: (clone()+0x43) [0x7f99f204ddd3] > Mar 21 10:07:54 ceph-23 journal: > Mar 21 10:07:54 ceph-23 journal: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) > Mar 21 10:07:54 ceph-23 journal: 1: (()+0x12ce0) [0x7f99f3605ce0] > Mar 21 10:07:54 ceph-23 journal: 2: (gsignal()+0x10f) [0x7f99f2062a9f] > Mar 21 10:07:54 ceph-23 journal: 3: (abort()+0x127) [0x7f99f2035e05] > Mar 21 10:07:54 ceph-23 journal: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f99f4a25be3] > Mar 21 10:07:54 ceph-23 journal: 5: (()+0x27ddac) [0x7f99f4a25dac] > Mar 21 10:07:54 ceph-23 journal: 6: (MDCache::truncate_inode(CInode*, LogSegment*)+0x32c) [0x561bd623962c] > Mar 21 10:07:54 ceph-23 journal: 7: (C_MDS_inode_update_finish::finish(int)+0x133) [0x561bd6210a83] > Mar 21 10:07:54 ceph-23 journal: 8: (MDSContext::complete(int)+0x56) [0x561bd6422656] > Mar 21 10:07:54 ceph-23 journal: 9: (MDSIOContextBase::complete(int)+0x39c) [0x561bd6422b5c] > Mar 21 10:07:54 ceph-23 journal: 10: (MDSLogContextBase::complete(int)+0x44) [0x561bd6422cb4] > Mar 21 10:07:54 ceph-23 journal: 11: (Finisher::finisher_thread_entry()+0x1a5) [0x7f99f4ab6a95] > Mar 21 10:07:54 ceph-23 journal: 12: (()+0x81ca) [0x7f99f35fb1ca] > Mar 21 10:07:54 ceph-23 journal: 13: (clone()+0x43) [0x7f99f204ddd3] > Mar 21 10:07:54 ceph-23 journal: 2023-03-21T10:07:54.982+0100 7f99e63d5700 -1 *** Caught signal (Aborted) ** > Mar 21 10:07:54 ceph-23 journal: in thread 7f99e63d5700 thread_name:MR_Finisher > Mar 21 10:07:54 ceph-23 journal: > Mar 21 10:07:54 ceph-23 journal: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) > Mar 21 10:07:54 ceph-23 journal: 1: (()+0x12ce0) [0x7f99f3605ce0] > Mar 21 10:07:54 ceph-23 journal: 2: (gsignal()+0x10f) [0x7f99f2062a9f] > Mar 21 10:07:54 ceph-23 journal: 3: (abort()+0x127) [0x7f99f2035e05] > Mar 21 10:07:54 ceph-23 journal: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f99f4a25be3] > Mar 21 10:07:54 ceph-23 journal: 5: (()+0x27ddac) [0x7f99f4a25dac] > Mar 21 10:07:54 ceph-23 journal: 6: (MDCache::truncate_inode(CInode*, LogSegment*)+0x32c) [0x561bd623962c] > Mar 21 10:07:54 ceph-23 journal: 7: (C_MDS_inode_update_finish::finish(int)+0x133) [0x561bd6210a83] > Mar 21 10:07:54 ceph-23 journal: 8: (MDSContext::complete(int)+0x56) [0x561bd6422656] > Mar 21 10:07:54 ceph-23 journal: 9: (MDSIOContextBase::complete(int)+0x39c) [0x561bd6422b5c] > Mar 21 10:07:54 ceph-23 journal: 10: (MDSLogContextBase::complete(int)+0x44) [0x561bd6422cb4] > Mar 21 10:07:54 ceph-23 journal: 11: (Finisher::finisher_thread_entry()+0x1a5) [0x7f99f4ab6a95] > Mar 21 10:07:54 ceph-23 journal: 12: (()+0x81ca) [0x7f99f35fb1ca] > Mar 21 10:07:54 ceph-23 journal: 13: (clone()+0x43) [0x7f99f204ddd3] > Mar 21 10:07:54 ceph-23 journal: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > Mar 21 10:07:54 ceph-23 journal: > Mar 21 10:07:55 ceph-23 journal: -1> 2023-03-21T10:07:54.980+0100 7f99e63d5700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/mds/ScatterLock.h: In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7f99e63d5700 time 2023-03-21T10:07:54.967936+0100 > Mar 21 10:07:55 ceph-23 journal: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/mds/ScatterLock.h: 59: FAILED ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE) > Mar 21 10:07:55 ceph-23 journal: > Mar 21 10:07:55 ceph-23 journal: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) > Mar 21 10:07:55 ceph-23 journal: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f99f4a25b92] > Mar 21 10:07:55 ceph-23 journal: 2: (()+0x27ddac) [0x7f99f4a25dac] > Mar 21 10:07:55 ceph-23 journal: 3: (MDCache::truncate_inode(CInode*, LogSegment*)+0x32c) [0x561bd623962c] > Mar 21 10:07:55 ceph-23 journal: 4: (C_MDS_inode_update_finish::finish(int)+0x133) [0x561bd6210a83] > Mar 21 10:07:55 ceph-23 journal: 5: (MDSContext::complete(int)+0x56) [0x561bd6422656] > Mar 21 10:07:55 ceph-23 journal: 6: (MDSIOContextBase::complete(int)+0x39c) [0x561bd6422b5c] > Mar 21 10:07:55 ceph-23 journal: 7: (MDSLogContextBase::complete(int)+0x44) [0x561bd6422cb4] > Mar 21 10:07:55 ceph-23 journal: 8: (Finisher::finisher_thread_entry()+0x1a5) [0x7f99f4ab6a95] > Mar 21 10:07:55 ceph-23 journal: 9: (()+0x81ca) [0x7f99f35fb1ca] > Mar 21 10:07:55 ceph-23 journal: 10: (clone()+0x43) [0x7f99f204ddd3] > Mar 21 10:07:55 ceph-23 journal: > Mar 21 10:07:55 ceph-23 journal: 0> 2023-03-21T10:07:54.982+0100 7f99e63d5700 -1 *** Caught signal (Aborted) ** > Mar 21 10:07:55 ceph-23 journal: in thread 7f99e63d5700 thread_name:MR_Finisher > Mar 21 10:07:55 ceph-23 journal: > Mar 21 10:07:55 ceph-23 journal: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) > Mar 21 10:07:55 ceph-23 journal: 1: (()+0x12ce0) [0x7f99f3605ce0] > Mar 21 10:07:55 ceph-23 journal: 2: (gsignal()+0x10f) [0x7f99f2062a9f] > Mar 21 10:07:55 ceph-23 journal: 3: (abort()+0x127) [0x7f99f2035e05] > Mar 21 10:07:55 ceph-23 journal: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f99f4a25be3] > Mar 21 10:07:55 ceph-23 journal: 5: (()+0x27ddac) [0x7f99f4a25dac] > Mar 21 10:07:55 ceph-23 journal: 6: (MDCache::truncate_inode(CInode*, LogSegment*)+0x32c) [0x561bd623962c] > Mar 21 10:07:55 ceph-23 journal: 7: (C_MDS_inode_update_finish::finish(int)+0x133) [0x561bd6210a83] > Mar 21 10:07:55 ceph-23 journal: 8: (MDSContext::complete(int)+0x56) [0x561bd6422656] > Mar 21 10:07:55 ceph-23 journal: 9: (MDSIOContextBase::complete(int)+0x39c) [0x561bd6422b5c] > Mar 21 10:07:55 ceph-23 journal: 10: (MDSLogContextBase::complete(int)+0x44) [0x561bd6422cb4] > Mar 21 10:07:55 ceph-23 journal: 11: (Finisher::finisher_thread_entry()+0x1a5) [0x7f99f4ab6a95] > Mar 21 10:07:55 ceph-23 journal: 12: (()+0x81ca) [0x7f99f35fb1ca] > Mar 21 10:07:55 ceph-23 journal: 13: (clone()+0x43) [0x7f99f204ddd3] > Mar 21 10:07:55 ceph-23 journal: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > Mar 21 10:07:55 ceph-23 journal: > Mar 21 10:07:55 ceph-23 journal: -9999> 2023-03-21T10:07:54.980+0100 7f99e63d5700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/mds/ScatterLock.h: In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7f99e63d5700 time 2023-03-21T10:07:54.967936+0100 > Mar 21 10:07:55 ceph-23 journal: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/mds/ScatterLock.h: 59: FAILED ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE) > Mar 21 10:07:55 ceph-23 journal: > Mar 21 10:07:55 ceph-23 journal: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) > Mar 21 10:07:55 ceph-23 journal: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f99f4a25b92] > Mar 21 10:07:55 ceph-23 journal: 2: (()+0x27ddac) [0x7f99f4a25dac] > Mar 21 10:07:55 ceph-23 journal: 3: (MDCache::truncate_inode(CInode*, LogSegment*)+0x32c) [0x561bd623962c] > Mar 21 10:07:55 ceph-23 journal: 4: (C_MDS_inode_update_finish::finish(int)+0x133) [0x561bd6210a83] > Mar 21 10:07:55 ceph-23 journal: 5: (MDSContext::complete(int)+0x56) [0x561bd6422656] > Mar 21 10:07:55 ceph-23 journal: 6: (MDSIOContextBase::complete(int)+0x39c) [0x561bd6422b5c] > Mar 21 10:07:55 ceph-23 journal: 7: (MDSLogContextBase::complete(int)+0x44) [0x561bd6422cb4] > Mar 21 10:07:55 ceph-23 journal: 8: (Finisher::finisher_thread_entry()+0x1a5) [0x7f99f4ab6a95] > Mar 21 10:07:55 ceph-23 journal: 9: (()+0x81ca) [0x7f99f35fb1ca] > Mar 21 10:07:55 ceph-23 journal: 10: (clone()+0x43) [0x7f99f204ddd3] > Mar 21 10:07:55 ceph-23 journal: > Mar 21 10:07:55 ceph-23 journal: -9998> 2023-03-21T10:07:54.982+0100 7f99e63d5700 -1 *** Caught signal (Aborted) ** > Mar 21 10:07:55 ceph-23 journal: in thread 7f99e63d5700 thread_name:MR_Finisher > Mar 21 10:07:55 ceph-23 journal: > Mar 21 10:07:55 ceph-23 journal: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) > Mar 21 10:07:55 ceph-23 journal: 1: (()+0x12ce0) [0x7f99f3605ce0] > Mar 21 10:07:55 ceph-23 journal: 2: (gsignal()+0x10f) [0x7f99f2062a9f] > Mar 21 10:07:55 ceph-23 journal: 3: (abort()+0x127) [0x7f99f2035e05] > Mar 21 10:07:55 ceph-23 journal: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f99f4a25be3] > Mar 21 10:07:55 ceph-23 journal: 5: (()+0x27ddac) [0x7f99f4a25dac] > Mar 21 10:07:55 ceph-23 journal: 6: (MDCache::truncate_inode(CInode*, LogSegment*)+0x32c) [0x561bd623962c] > Mar 21 10:07:55 ceph-23 journal: 7: (C_MDS_inode_update_finish::finish(int)+0x133) [0x561bd6210a83] > Mar 21 10:07:55 ceph-23 journal: 8: (MDSContext::complete(int)+0x56) [0x561bd6422656] > Mar 21 10:07:55 ceph-23 journal: 9: (MDSIOContextBase::complete(int)+0x39c) [0x561bd6422b5c] > Mar 21 10:07:55 ceph-23 journal: 10: (MDSLogContextBase::complete(int)+0x44) [0x561bd6422cb4] > Mar 21 10:07:55 ceph-23 journal: 11: (Finisher::finisher_thread_entry()+0x1a5) [0x7f99f4ab6a95] > Mar 21 10:07:55 ceph-23 journal: 12: (()+0x81ca) [0x7f99f35fb1ca] > Mar 21 10:07:55 ceph-23 journal: 13: (clone()+0x43) [0x7f99f204ddd3] > Mar 21 10:07:55 ceph-23 journal: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > Mar 21 10:07:55 ceph-23 journal: > Mar 21 10:07:55 ceph-23 journal: reraise_fatal: default handler for signal 6 didn't terminate the process? > Mar 21 10:07:58 ceph-23 dockerd-current: time="2023-03-21T10:07:58.119559277+01:00" level=warning msg="040c1e98a0669204e0e98bdbcdde893f8acf63444f3827358e663a13a2869478 cleanup: failed to unmount secrets: invalid argument" > Mar 21 10:07:58 ceph-23 kernel: overlayfs: upperdir is in-use as upperdir/workdir of another mount, accessing files from both mounts will result in undefined behavior. > Mar 21 10:07:58 ceph-23 kernel: overlayfs: workdir is in-use as upperdir/workdir of another mount, accessing files from both mounts will result in undefined behavior. > Mar 21 10:07:58 ceph-23 journal: 118 get_config /opt/ceph-container/bin/config.static.sh > Mar 21 10:07:58 ceph-23 journal: 5 start_mds /opt/ceph-container/bin/start_mds.sh > Mar 21 10:07:58 ceph-23 journal: 120 main /opt/ceph-container/bin/entrypoint.sh > Mar 21 10:07:58 ceph-23 journal: 2023-03-21 10:07:58 /opt/ceph-container/bin/entrypoint.sh: static: does not generate config > Mar 21 10:07:58 ceph-23 journal: 58 start_mds /opt/ceph-container/bin/start_mds.sh > Mar 21 10:07:58 ceph-23 journal: 120 main /opt/ceph-container/bin/entrypoint.sh > Mar 21 10:07:58 ceph-23 journal: 2023-03-21 10:07:58 /opt/ceph-container/bin/entrypoint.sh: SUCCESS > Mar 21 10:07:58 ceph-23 journal: starting mds.ceph-23 at > > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > -- Best Regards, Xiubo Li (李秀波) Email: xiubli@xxxxxxxxxx/xiubli@xxxxxxx Slack: @Xiubo Li _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx