I can't interpret it, some help is appreciated.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Wednesday, May 10, 2023 3:36 PM
To: ceph-users@xxxxxxx
Subject: Re: mds dump inode crashes file system
For the "mds dump inode" command I could find the crash in the log; see below. Most of the log contents is the past OPS dump from the 3 MDS restarts that happened. It contains the 10000 last OPS before the crash and I can upload the log if someone can use it. The crash stack trace somewhat truncated for readability:
2023-05-10T12:54:53.142+0200 7fe971ca6700 1 mds.ceph-23 Updating MDS map to version 892464 from mon.4
2023-05-10T13:39:50.962+0200 7fe96fca2700 0 log_channel(cluster) log [WRN] : client.205899841 isn't responding to mclientcaps(revoke), ino 0x20011d3e5cb pending pAsLsXsFscr issued pAsLsXsFscr, sent 61.705410 seconds ago
2023-05-10T13:39:52.550+0200 7fe971ca6700 1 mds.ceph-23 Updating MDS map to version 892465 from mon.4
2023-05-10T13:40:50.963+0200 7fe96fca2700 0 log_channel(cluster) log [WRN] : client.205899841 isn't responding to mclientcaps(revoke), ino 0x20011d3e5cb pending pAsLsXsFscr issued pAsLsXsFscr, sent 121.706193 seconds ago
2023-05-10T13:42:50.966+0200 7fe96fca2700 0 log_channel(cluster) log [WRN] : client.205899841 isn't responding to mclientcaps(revoke), ino 0x20011d3e5cb pending pAsLsXsFscr issued pAsLsXsFscr, sent 241.709072 seconds ago
2023-05-10T13:44:00.506+0200 7fe972ca8700 1 mds.ceph-23 asok_command: dump inode {number=2199322355147,prefix=dump inode} (starting...)
2023-05-10T13:44:00.520+0200 7fe972ca8700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/common/buffer.cc: In function 'const char* ceph::buffer::v15_2_0::ptr::c_str() const' thread 7fe972ca8700 time 2023-05-10T13:44:00.507652+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/common/buffer.cc: 501: FAILED ceph_assert(_raw)
ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7fe979ae9b92]
2: (()+0x27ddac) [0x7fe979ae9dac]
3: (()+0x5ce831) [0x7fe979e3a831]
4: (InodeStoreBase::dump(ceph::Formatter*) const+0x153) [0x55c08c59b543]
5: (CInode::dump(ceph::Formatter*, int) const+0x144) [0x55c08c59b8d4]
6: (MDCache::dump_inode(ceph::Formatter*, unsigned long)+0x7c) [0x55c08c41e00c]
7: (MDSRank::command_dump_inode(ceph::Formatter*, ..., std::ostream&)+0xb5) [0x55c08c353e75]
8: (MDSRankDispatcher::handle_asok_command(std::basic_string_view<char, std::char_traits<char> >, ..., ceph::buffer::v15_2_0::list&)>)+0x2296) [0x55c08c36c5f6]
9: (MDSDaemon::asok_command(std::basic_string_view<char, ..., ceph::buffer::v15_2_0::list&)>)+0x75b) [0x55c08c340eab]
10: (MDSSocketHook::call_async(std::basic_string_view<char, std::char_traits<char> >, ..., ceph::buffer::v15_2_0::list&)>)+0x6a) [0x55c08c34f9ca]
11: (AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, ..., ceph::buffer::v15_2_0::list&)>)+0x6f9) [0x7fe979bece59]
12: (AdminSocket::do_tell_queue()+0x289) [0x7fe979bed809]
13: (AdminSocket::entry()+0x4d3) [0x7fe979beefd3]
14: (()+0xc2ba3) [0x7fe977afaba3]
15: (()+0x81ca) [0x7fe9786bf1ca]
16: (clone()+0x43) [0x7fe977111dd3]
2023-05-10T13:44:00.522+0200 7fe972ca8700 -1 *** Caught signal (Aborted) **
in thread 7fe972ca8700 thread_name:admin_socket
ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
1: (()+0x12ce0) [0x7fe9786c9ce0]
2: (gsignal()+0x10f) [0x7fe977126a9f]
3: (abort()+0x127) [0x7fe9770f9e05]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7fe979ae9be3]
5: (()+0x27ddac) [0x7fe979ae9dac]
6: (()+0x5ce831) [0x7fe979e3a831]
7: (InodeStoreBase::dump(ceph::Formatter*) const+0x153) [0x55c08c59b543]
8: (CInode::dump(ceph::Formatter*, int) const+0x144) [0x55c08c59b8d4]
9: (MDCache::dump_inode(ceph::Formatter*, unsigned long)+0x7c) [0x55c08c41e00c]
10: (MDSRank::command_dump_inode(ceph::Formatter*, ..., std::ostream&)+0xb5) [0x55c08c353e75]
11: (MDSRankDispatcher::handle_asok_command(std::basic_string_view<char, std::char_traits<char> >, ..., ceph::buffer::v15_2_0::list&)>)+0x2296) [0x55c08c36c5f6]
12: (MDSDaemon::asok_command(std::basic_string_view<char, std::char_traits<char> >, ..., ceph::buffer::v15_2_0::list&)>)+0x75b) [0x55c08c340eab]
13: (MDSSocketHook::call_async(std::basic_string_view<char, std::char_traits<char> >, ..., ceph::buffer::v15_2_0::list&)>)+0x6a) [0x55c08c34f9ca]
14: (AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, ..., ceph::buffer::v15_2_0::list&)>)+0x6f9) [0x7fe979bece59]
15: (AdminSocket::do_tell_queue()+0x289) [0x7fe979bed809]
16: (AdminSocket::entry()+0x4d3) [0x7fe979beefd3]
17: (()+0xc2ba3) [0x7fe977afaba3]
18: (()+0x81ca) [0x7fe9786bf1ca]
19: (clone()+0x43) [0x7fe977111dd3]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Wednesday, May 10, 2023 2:33 PM
To: ceph-users@xxxxxxx
Subject: mds dump inode crashes file system
Hi all,
I have an annoying problem with a specific ceph fs client. We have a file server on which we re-export kernel mounts via samba (all mounts with noshare option). On one of these re-exports we have recurring problems. Today I caught it with
2023-05-10T13:39:50.963685+0200 mds.ceph-23 (mds.1) 1761 : cluster [WRN] client.205899841 isn't responding to mclientcaps(revoke), ino 0x20011d3e5cb pending pAsLsXsFscr issued pAsLsXsFscr, sent 61.705410 seconds ago
and I wanted to look up what path the inode 0x20011d3e5cb points to. Unfortunately, the command
ceph tell "mds.*" dump inode 0x20011d3e5cb
crashes an MDS in a way that it restarts itself, but doesn't seem to come back clean (it does not fail over to a stand-by). If I repeat the command above, it crashes the MDS again. Execution on other MDS daemons succeeds, for example:
# ceph tell "mds.ceph-24" dump inode 0x20011d3e5cb
2023-05-10T14:14:37.091+0200 7fa47ffff700 0 client.210149523 ms_handle_reset on v2:192.168.32.88:6800/3216233914
2023-05-10T14:14:37.124+0200 7fa4857fa700 0 client.210374440 ms_handle_reset on v2:192.168.32.88:6800/3216233914
dump inode failed, wrong inode number or the inode is not cached
The caps recall gets the client evicted at some point but it doesn't manage to come back clean. On a single ceph fs mount point I see this
# ls /shares/samba/rit-oil
ls: cannot access '/shares/samba/rit-oil': Stale file handle
All other mount points are fine, just this one acts up. A "mount -o remount /shares/samba/rit-oil" crashed the entire server and I had to do a cold reboot. On reboot I see this message: https://imgur.com/a/bOSLxBb , which only occurs on this one file server (we are running a few of those). Does this point to a more serious problem, like a file system corruption? Should I try an fs scrub on the corresponding path?
Some info about the system:
The file server's kernel version is quite recent, updated two weeks ago:
$ uname -r
4.18.0-486.el8.x86_64
# cat /etc/redhat-release
CentOS Stream release 8
Our ceph cluster is octopus latest and we use the packages from the octopus el8 repo on this server.
We have several such shares and they all work fine. It is only on one share where we have persistent problems with the mount point hanging or the server freezing and crashing.
After working hours I will try a proper fail of the "broken" MDS to see if I can execute the dump inode command without it crashing everything.
In the mean time, any hints would be appreciated. I see that we have an exceptionally large MDS log for the problematic one. Any hint what to look for would be appreciated, it contains a lot from the recovery operations:
# pdsh -w ceph-[08-17,23-24] ls -lh "/var/log/ceph/ceph-mds.ceph-??.log"
ceph-23: -rw-r--r--. 1 ceph ceph 15M May 10 14:28 /var/log/ceph/ceph-mds.ceph-23.log *** huge ***
ceph-24: -rw-r--r--. 1 ceph ceph 14K May 10 14:28 /var/log/ceph/ceph-mds.ceph-24.log
ceph-10: -rw-r--r--. 1 ceph ceph 394 May 10 14:02 /var/log/ceph/ceph-mds.ceph-10.log
ceph-13: -rw-r--r--. 1 ceph ceph 394 May 10 14:02 /var/log/ceph/ceph-mds.ceph-13.log
ceph-08: -rw-r--r--. 1 ceph ceph 394 May 10 14:02 /var/log/ceph/ceph-mds.ceph-08.log
ceph-15: -rw-r--r--. 1 ceph ceph 14K May 10 14:28 /var/log/ceph/ceph-mds.ceph-15.log
ceph-17: -rw-r--r--. 1 ceph ceph 14K May 10 14:28 /var/log/ceph/ceph-mds.ceph-17.log
ceph-14: -rw-r--r--. 1 ceph ceph 16K May 10 14:28 /var/log/ceph/ceph-mds.ceph-14.log
ceph-09: -rw-r--r--. 1 ceph ceph 16K May 10 14:28 /var/log/ceph/ceph-mds.ceph-09.log
ceph-16: -rw-r--r--. 1 ceph ceph 15K May 10 14:28 /var/log/ceph/ceph-mds.ceph-16.log
ceph-11: -rw-r--r--. 1 ceph ceph 14K May 10 14:28 /var/log/ceph/ceph-mds.ceph-11.log
ceph-12: -rw-r--r--. 1 ceph ceph 394 May 10 14:02 /var/log/ceph/ceph-mds.ceph-12.log
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx